Project Report On Breast Cancer

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 47

A Project Report


Breast Cancer Detection Using Machine Learning And Python

In partial fulfilment of the requirements

for the degree of




Submitted by

Anshika Gangwar
Anshika Latawa
Shresti Sharma
Under Supervision of

Er. Hiresh Kumar Gupta

Department of Computer Science and Engineering

Shri Ram Murti Smarak College of Engineering & Technology, Bareilly

Dr. A. P. J. Abdul Kalam Technical University, Lucknow

June, 2023
I hereby declare that this submission is my own work and that, to the best of my
knowledge and belief, it contains no material previously published or written by another
person nor material which to a substantial extent has been accepted for the award of any
other degree or diploma of the university or other institute of higher learning, except
where due acknowledgment has been made in the text.

Signature………………………………… Signature………………………………

Name…………………………………….. Name…………………………………..

Roll No………………………………….. Roll No…………………………………

Date…………………………………….. Date…………………………………….

Signature………………………………… Signature………………………………

Name…………………………………….. Name…………………………………..

Roll No………………………………….. Roll No…………………………………

Date…………………………………….. Date……………………………………

This is to certify that the Project Report entitled Breast Cancer Prediction Using Machine
Learning And Python which is submitted by Anshika Gangwar (1900140100015),
Anshika Latawa (1900140100016), Arati (1900140100021), and Shresti Sharma
(1900140100102) is a record of the candidates own work carried out by them under my
supervision. The matter embodied in this work is original and has not been submitted for
the award of any other work or degree.

Er.Hiresh Kumar Gupta Er.Hiresh Kumar Gupta

HOD & Project Incharge Supervisor


It gives us a great sense of pleasure to present the report of the B. Tech Project
undertaken during B. Tech. Final Year. We owe special debt of gratitude to Er. Hiresh
Kumar Gupta, Head, Department of Computer Science and Engineering, S.R.M.S.C.E.T,
Bareilly for his constant support and guidance throughout the course of our work. His
sincerity, thoroughness and perseverance have been a constant source of inspiration for
us. It is only his cognizant efforts that our endeavors have seen light of the day.

We also do not like to miss the opportunity to acknowledge the contribution of all faculty
members of the department for their kind assistance and cooperation during the
development of our project. Last but not the least, we acknowledge our friends for their
contribution in the completion of the project.

Signature………………………………… Signature………………………………

Name…………………………………….. Name…………………………………..

Roll No………………………………….. Roll No…………………………………

Date…………………………………….. Date…………………………………….

Signature………………………………… Signature………………………………

Name…………………………………….. Name…………………………………..

Roll No………………………………….. Roll No…………………………………

Date…………………………………….. Date……………………………………

Breast cancer is one of the most common cancers among women worldwide, representing the
majority of new cancer cases and cancer-related deaths according to global statistics, making it a
significant public health problem in today’s society.

The early diagnosis of Breast Cancer can improve the prognosis and chance of survival
significantly, as it can promote timely clinical treatment to patients. Further accurate classification
of benign tumors can prevent patients undergoing unnecessary treatments. Thus, the correct
diagnosis of Breast Cancer and classification of patients into malignant or benign groups is the
subject of much research. Because of its unique advantages in critical features detection from
complex Breast Cancer datasets, machine learning is widely recognized as the methodology of
choice in Breast Cancer pattern classification and forecast modelling.

Classification and data mining methods are an effective way to classify data. Especially in
medical field, where those methods are widely used in diagnosis and analysis to make decisions.
The lack of strong analysis models consequences in difficulty for doctors to put together a
remedy plan which can prolong affected person survival time. as a result, the requirement of time
is to broaden the approach which gives minimum errors to boom accuracy. Five set of rules
SVM, Logistic Regression, Random forest, Decision Tree and KNN which expect the breast
cancer final results have been in comparison inside the paper the usage of exclusive datasets. All
experiments are performed within a simulation surroundings and performed in JUPYTER

The proposed work may be used to expect the final results of various method and suitable
approach may be used relying upon requirement. This studies is executed to predict the accuracy.

Breast cancer is one of the most prominent type of cancer among women all around the world,
according to a research conducted by World Health Organization (WHO). Breast Cancer is a
leading causes of death among women all around the world. Breast cancer also has an
exceedingly high rate of cancer fatalities in India which is around 14% and is the most common
cancer among women. Breast Cancer affects about 5% of Indian women, but it affects about 12.5
percent of women in Europe and the United States.

The 5th big reason of females death is Breast Cancer comparatively to cancers in terms of all
types. The malignant tumor of Breast Cancer which produced inside breast cells. A group of
splitting cells that form a lump or mass of extra tissue which is called Tumors and these tumors
can be whichever cancerous (malignant) or non-cancerous (benign).
As prognosis is so critical for long-term survival, early detection of breast cancer benefits early
treatment and diagnosis. Because cancer can be detected, diagnosed, and treated only if detected
early, the chance of death is reduced by early detection. It plays a vital role patient's survival.
Delay in diagnosing cancer or detecting it at a later stage may lead to the spreading of disease
and complications in treatment.

Cancer-related research done in the past on the effects of a late cancer diagnosis has found that it
is very closely linked to the disease progressing to advanced stages, lowering the likelihood of
saving the patient's life. An analysis of 87 researchers found that female breast cancer patients
who begin treatment within 90 days after the onset of symptoms had a considerably higher
likelihood of surviving than those who wait more than 90 days. Many earlier studies have found
that detecting breast cancer in its early stages and starting the treatment on time increases the
chances of survival by preventing malignant (Cancerous) cells from spreading throughout the
body. This paper's main contribution is a evaluation and study of the role of various machine
learning approaches in breast cancer early detection.
Artificial intelligence (AI) and Machine Learning together can be implemented to improve breast
cancer detection, while also avoiding overtreatment. Nonetheless, merging AI with Machine
Learning (ML) approaches helps achieve accurate prediction and decision-making. For e.g.,
deciding whether or not the patient needs surgery based on the biopsy results for detecting breast
cancer. Mammograms are currently the most utilized test, they can give false positive (high-risk)
results, which can lead to unnecessary biopsies and procedures. When surgery is performed to
remove malignant cells, it is sometimes discovered that the cells are benign that are non-
cancerous. This implies that the patient will be subjected to unnecessary, unpleasant, and a costly
surgery. M.L. Algorithms have a number of benefits, including their ability to perform well on
healthcare-related datasets such as pictures, xrays, and blood samples. Some strategies are better
suited to small datasets, while others are best suited to large datasets. Noise can be an issue with
some methods.


Mammography remains a critical tool for screening and diagnosing breast cancers. Advocates for
mammography screening refer to its widely documented contribution in reducing breast cancer
mortality rates. While mammographic screening has established reduction in mortality, like any
examination, there is a false-positive rate associated with screening mammography. While only
7–12% of women are falsely recalled after only one mammogram, over 50% of women who
have undergone annual mammography screening for 10 years will be recalled incorrectly. These
false positives translate into increased benign biopsies, increased spending, and negative
psychological effects for patients involved.

Likewise, potentially malignant neoplasms are at risk of being missed due to their small size or
surrounding dense fibroglandular tissue. False-negative mammogram results have a higher
incidence in women aged 50–89 with previous benign biopsies. However, the rate of false-
negative results is still relatively low, reported in 1.0 to 1.5 per 1,000 women. Although its
accuracy continues to improve with technical improvements, diagnostic mammography is the
gold standard for evaluation of breast cancer. In order to further increase the accuracy and to
further reduce the rates of false positives and false negatives, recent advances in machine
Learning (ML) and artificial intelligence (AI) have been exploited to develop software capable
of aiding radiologists in clinical practice.

Currently, many of these AI-based tools designed for aiding radiologists and interpreting
mammograms are developed with machine learning. Machine learning is a specific domain of AI
and is concerned with constructing algorithms used by computers to perform certain tasks
without using explicit instructions, but instead relying on inference and patterns and are able to
improve their performance with experience.


Machine learning is a branch of artificial intelligence and computer science which focuses
on the use of data and algorithms to imitate the way that humans learn, gradually
improving its accuracy.

Machine learning is an important component of the growing field of data science. Through the
use of statistical methods, algorithms are trained to make classifications or predictions,
uncovering key insights within data mining projects. These insights subsequently drive
decision making within applications and businesses, ideally impacting key growth metrics. As
big data continues to expand and grow, the market demand for data scientists will increase,
requiring them to assist in the identification of the most relevant business questions and
subsequently the data to answer them.

Basically breast is made up of different different tissues and tissues are the group of cells and
that tissues are ranging from fatty tissue to dense tissue and within the tissue there is a lobe
and that of each lobe is made up of small tube like structures.

So if we talk about breast cancer basically breast cancer is the second major death cause in
women’s that is Breast Cancer. Cancer starts when cells begin to grow out of control. Breast
Cancer cells usually form a type of tumor that can be often seen in X-Ray or felt as a lump.
Breast cancer can spread when the cancer cells get into the blood or lymph system and carried
to several parts of the body .

The main cause of breast cancer according to us are which includes some changes and
mutation in DNA. There are many types of breast cancer. A breast cancer is a malignant, that
means it can grow and spread to other parts of body too and a Benign tumor means in
which tumor can grow but has not spread rapidly.And mostly breast cancer spreads to the
nearby lymph nodes in which the breast cancer is still treated as a local disease, but it can also
spreads through one body to another through the blood vessels or we can say lymph nodes.

It’s important to understand that most breast lumps are benign and not cancer (malignant).
Non- cancerous breast tumors are abnormal growths, but they do not spread outside of the
breast. They are not life threatening, but some types of benign breast lumps can increase a
woman's risk of getting breast cancer. Any breast lump or change needs to be checked by a
health care professional to determine if it is benign or malignant (cancer) and if it might
affect your future cancer risk.

The breast is the tissue overlying the chest (pectoral) muscles. Women's breasts are made of
specialized tissue that produces milk (glandular tissue) as well as fatty tissue. The amount of
fat determines the size of the breast. The milk-producing part of the breast is organized into 15
to 20 sections, called lobes. Within each lobe are smaller structures, called lobules, where milk
is produced. The milk travels through a network of tiny tubes called ducts. The ducts connect
and come together into larger ducts, which eventually exit the skin in the nipple. The dark area
of skin surrounding the nipple is called the areola. Malignant (cancer) cells multiplying
abnormally in the breast, eventually spreading to the rest of the body if untreated. Breast
cancer occurs almost exclusively in women, although men can be affected. Signs of breast
cancer include a lump, bloody nipple discharge, or skin changes.

The number and the size of databases recording medical data are increasing rapidly. Medical
data, produced from measurements, examinations, prescriptions, etc., are stored in different
databases on a continuous basis. This enormous amount of data exceeds the ability of
traditional methods to analyze and search for interesting patterns and information that is
hidden in them. Therefore, new techniques and tools for discovering useful information in
these data depositories are becoming more demanding. Analyzing these data with new
analytical methods in order to find interesting patterns and hidden knowledge is the first step
in extending the traditional function of these data sources.

The term "data mining" is a misnomer, because the goal is the extraction of patterns and
knowledge from large amounts of data, not the extraction (mining) of data itself. It also is a
buzzword and is frequently applied to any form of large- scale data or information processing
(collection, extraction, warehousing, analysis, and statistics) as well as any application of
computer decision support system, including artificial intelligence (e.g., machine learning)
and business intelligence. The book Data mining: Practical machine learning tools and
techniques with given the class variable. Based on the maximum probability. It detects the
class membership for the given tuple to a particular class.

The term Breast Cancer refers to disease of breast. There are number of factors that can affect the
breast and leads to breast cancer.
1. Getting Older
2. Genetic Mutations
3. Smoking & Used Alcohol
4. Physical Activity
5. Obesity
6. Food
7. Having Dense Breasts
Factors like these are used to analyze the breast cancer. In many cases, diagnosis is generally based on
patient’s current test results & doctor’s experience. Thus the Diagnosis is a complex task that requires
much experience and high skill.

Breast cancer is a global problem, and 1.7 million new cases are diagnosed per year.
Approximately 60% of deaths due to breast cancer occur in developing countries, whereas in the
United States (US), an estimated 249,260 new cases of breast cancer are diagnosed each year,
and mortality due to this disease is decreasing. In contrast, breast cancer in developing countries
represents one-half of all breast cancer cases and 62% of the deaths.

Developing countries have limited healthcare resources and use different strategies to diagnose
breast cancer. Most of the population depends on the public healthcare system, which affects the
diagnosis of the tumor. Thus, the indicators observed in developed countries cannot be directly
compared with those observed in developing countries because the healthcare infrastructures in
developing countries are deficient.

In Figure-1.1 : The WHO analysis the data about causes of deaths in 2018 and result clearly
shows that the causes of breast cancer death is higher than other causes of death in Women’s.

The motivation behind using machine learning for breast cancer prediction is driven by the desire
to improve early detection and diagnosis of breast cancer, which is crucial for successful
treatment and improved patient outcomes. Machine learning algorithms have the potential to
analyze large amounts of data, identify patterns, and make accurate predictions based on the
learned patterns. In the case of breast cancer, these algorithms can analyze various factors and
characteristics of breast tissue to predict the likelihood of developing the disease.
Here are some key motivations for using machine learning in breast cancer prediction:

1. Early detection: Early detection of breast cancer is known to significantly improve the
chances of successful treatment and survival. Machine learning models can be trained on
large datasets containing information about breast tissue, such as mammograms, medical
records, genetic information, and patient demographics. By analyzing these diverse data
sources, machine learning algorithms can potentially identify subtle patterns and
indicators of early-stage breast cancer that may not be easily recognizable to human

2. Improved accuracy: Machine learning algorithms can learn complex patterns and
relationships within data, potentially leading to more accurate predictions compared to
traditional methods. By analyzing a multitude of factors and their interactions, machine
learning models can provide a more comprehensive assessment of breast cancer risk than
individual risk factors considered in isolation. This can help healthcare professionals
make more informed decisions about screening, diagnosis, and treatment strategies.

3. Personalized medicine: Breast cancer is a heterogeneous disease, meaning that it can

vary significantly among individuals. Machine learning algorithms can incorporate
various personal factors, such as genetic information, family history, lifestyle choices,
and medical history, to tailor predictions and recommendations to individual patients.
This enables a more personalized approach to breast cancer prevention and treatment,
optimizing patient care and outcomes.

4. Handling big data: The field of healthcare generates vast amounts of data, including
medical records, imaging data, genomic data, and clinical trial results. Machine learning
algorithms are well-suited to handle and analyze such big data, extracting meaningful
insights and patterns that may not be apparent to human analysts. By leveraging these
large datasets, machine learning models can potentially uncover new risk factors, identify
novel biomarkers, and improve our understanding of breast cancer.
5. Support for healthcare professionals: Machine learning models can serve as decision
support tools for healthcare professionals. By providing risk assessments and predictions,
these models can assist doctors in making more accurate and timely diagnoses,
optimizing treatment plans, and determining appropriate surveillance strategies for
individuals at higher risk. This can help reduce the burden on healthcare professionals
and improve overall patient care.

The primary objective of breast cancer prediction using machine learning is to develop accurate
and reliable models that can assist in the early detection, diagnosis, and treatment of breast
cancer. Here are some specific objectives of breast cancer prediction using machine learning:

1 Early detection: One of the main objectives is to detect breast cancer at an early stage when
it is more treatable and the chances of survival are higher. Machine learning models can
analyze various data sources, such as mammograms, patient demographics, genetic
information, and medical records, to identify patterns and indicators of early-stage breast
cancer that may not be easily detectable by human observers.

2 Risk assessment: Machine learning algorithms can assess the risk of developing breast
cancer by considering multiple risk factors and their interactions. By incorporating personal
factors, such as genetic predisposition, family history, lifestyle choices, and medical
history, these models can provide a personalized risk assessment for individuals. This helps
in identifying individuals who may benefit from more intensive screening or preventive

3 Prediction accuracy: Machine learning models aim to improve the accuracy of breast cancer
prediction compared to traditional methods. By analyzing large datasets and learning from
historical cases, these models can identify patterns and relationships that can aid in accurate
predictions. Higher prediction accuracy can lead to more effective screening, diagnosis, and
treatment planning, ultimately improving patient outcomes.
4 Feature identification: Machine learning algorithms can automatically identify relevant
features or biomarkers associated with breast cancer. By analyzing a large number of data
points, these models can uncover new risk factors or biomarkers that may not have been
previously recognized. This can contribute to a better understanding of breast cancer and
lead to the discovery of new diagnostic or therapeutic targets.
5 Decision support for healthcare professionals: Machine learning models can serve as
decision support tools for healthcare professionals. By providing risk assessments,
predictions, and treatment recommendations, these models can assist doctors in making
more informed decisions. They can help optimize treatment plans, determine appropriate
surveillance strategies, and provide personalized care for patients, ultimately improving the
quality of healthcare delivery.

6 Integration with existing healthcare systems: Another objective is to develop machine

learning models that can seamlessly integrate with existing healthcare systems. This allows
for the efficient utilization of patient data and enables real-time prediction and decision-
making support for healthcare professionals. Integration with electronic health records
(EHRs) and other healthcare systems ensures the practical applicability and scalability of
machine learning models in clinical settings.

Overall, the objective of breast cancer prediction using machine learning is to leverage the power
of data analysis and pattern recognition to improve early detection, risk assessment, and
treatment strategies for breast cancer, leading to better patient outcomes and reduced mortality


1.5.1 Software Requirements
Software Requirements Version
Operating System Window 10 or higher
Integrated Development Environment Microsoft Visual Studio 2019 or higher,
Annaconda Navigator, Jupyter Notebook,
Frameworks Scikit Learn, Python
Table 1.1: Table Showing Software Requirements
1.5.2 Hardware Requirements
Hardware Requirements Version
Processor Intel Core i5 or higher
RAM 8GB or higher
Storage 256 GB or higher
Table 1.2: Table Showing Hardware Requirements


The scope of breast cancer prediction using machine learning is broad and encompasses various
aspects of detection, diagnosis, risk assessment, and treatment planning. Here are some key areas
where machine learning can make a significant impact in breast cancer prediction:

1 Early Detection: Machine learning models can analyze mammograms, medical imaging
data, and other patient information to identify patterns and indicators of early-stage breast
cancer. By detecting cancer at an early stage, the chances of successful treatment and
improved patient outcomes can be significantly increased.

2 Risk Assessment: Machine learning algorithms can incorporate multiple risk factors, such
as genetic information, family history, lifestyle choices, and medical history, to provide
personalized risk assessments for individuals. These models can identify individuals at
higher risk of developing breast cancer and help guide screening and preventive strategies.

3 Image Analysis: Machine learning techniques, including computer vision and deep
learning, can be applied to medical images such as mammograms, ultrasounds, and MRIs
to automate the analysis process. These models can assist radiologists in detecting
abnormalities, segmenting tumors, and predicting the malignancy of breast lesions.

4 Biomarker Identification: Machine learning can analyze genomic data, proteomic data, and
other molecular data to identify biomarkers associated with breast cancer. These
biomarkers can provide insights into disease progression, treatment response, and potential
therapeutic targets.

5 Treatment Planning: Machine learning models can help guide treatment decisions by
predicting the effectiveness of different treatment options for individual patients. They can
analyze patient data, treatment outcomes, and clinical guidelines to provide personalized
treatment recommendations and assist healthcare professionals in selecting the most
suitable therapeutic interventions.

6 Prognosis and Survival Prediction: Machine learning algorithms can analyze patient data,
including clinical features, imaging results, and treatment history, to predict patient
prognosis and survival rates. These predictions can aid in treatment planning and help
patients and healthcare providers make informed decisions about care options.

7 Integration with Electronic Health Records (EHRs): Machine learning models can be
integrated with electronic health record systems to analyze large-scale patient data. This
integration enables comprehensive analysis of diverse patient information, facilitating
population-level studies, quality improvement initiatives, and clinical decision support.

8 Public Health Applications: Machine learning techniques can be applied to population-level

data to identify patterns, trends, and risk factors associated with breast cancer. This
information can be used to develop preventive strategies, optimize public health
interventions, and allocate healthcare resources effectively.

It's important to note that while machine learning shows promise in breast cancer prediction,
these models should always be used as decision support tools and not as a substitute for medical
professionals. The ultimate goal is to augment human expertise and improve patient care in the
field of breast cancer.


The feasibility of breast cancer prediction using machine learning has been widely demonstrated
and holds significant potential in improving early detection and patient outcomes. Here are
several factors that contribute to the feasibility of breast cancer prediction using machine

1. Abundance of Data: There is a substantial amount of available data related to breast cancer,
including mammograms, patient demographics, genetic information, and histopathological
data. Machine learning models thrive on large and diverse datasets, allowing them to learn
patterns and make accurate predictions. The availability of such data makes breast cancer
prediction using machine learning feasible.

2. Technological Advancements: Rapid advancements in machine learning algorithms,

particularly in deep learning, have greatly enhanced the feasibility of breast cancer
prediction. Deep learning models, such as convolutional neural networks (CNNs), have
demonstrated high accuracy in analyzing medical images and detecting breast cancer.
These advancements enable the development of more sophisticated and accurate predictive

3. Increased Computing Power: The availability of powerful computing resources, such as

GPUs and cloud computing platforms, has significantly improved the feasibility of training
and deploying machine learning models. Complex algorithms and large datasets can be
processed efficiently, reducing the time and resources required for model development.

4. Feature Selection and Dimensionality Reduction: Machine learning techniques, including

feature selection and dimensionality reduction algorithms, help identify the most relevant
features and reduce the dimensionality of the input data. This process improves model
performance and reduces computational requirements, making breast cancer prediction
more feasible.

5. Model Generalization: Machine learning models can be trained on diverse datasets and
generalize well to unseen data. This allows the models to make accurate predictions on new
patient cases, increasing the feasibility of their use in real-world clinical settings.
6. Integration with Clinical Workflows: Machine learning models can be integrated into
existing clinical workflows, such as electronic health record (EHR) systems, to facilitate
seamless adoption. This integration ensures that the predictive models align with the
existing healthcare infrastructure, making their implementation more feasible.

7. Research and Collaborations: There is extensive ongoing research and collaboration in the
field of breast cancer prediction using machine learning. Researchers, clinicians, and
industry experts collaborate to develop and validate machine learning models, ensuring that
the feasibility of these models is continuously improved.

8. Potential Impact on Healthcare: Breast cancer prediction using machine learning has the
potential to significantly impact healthcare outcomes. Early detection, accurate risk
assessment, and personalized treatment planning can lead to improved patient survival
rates, reduced healthcare costs, and better resource allocation within healthcare systems.

However, it is important to note that there are challenges and limitations associated with breast
cancer prediction using machine learning, such as the need for high-quality labeled data,
interpretability of complex models, and ethical considerations surrounding data privacy and bias.
Addressing these challenges requires ongoing research, collaboration, and careful
implementation to ensure the responsible and effective use of machine learning in breast cancer


2.1 Timeline of the Reported Problem

With growing development in the field of medical science alongside machine learning various
experiment and research has been carried out in these recent years releasing the relevant
significant papers Breast cancer is the most common cancers in ladies round the arena.

It has been extensively studied at some points of records. In fact, research on breast cancer has
helped pave the way for breakthroughs in different styles of cancer research. How we deal with
breast cancer has changed in many approaches from the cancer’s first discovery. However other
findings and treatments have remained the same for years.

Humans have recognized breast cancer for a long time. As an instance, the Edwin Smith Surgical
Papyrus describes instances of breast cancer Trusted source. This medical textual content dates
returned to a few years, 2000-2,500 B.C.E. Within the first century, medical doctors
experimented with surgical incisions to spoil tumors. additionally, they thought that breast cancer
turned into connected with the quilt of menstruation. This concept might also have brought about
the affiliation of cancer with older age. in the beginning of the middle a while, clinical progress
turned into intertwined with new religious philosophies.

Christians thought surgery changed into barbaric and have been in favor of religion recovery. in
the meantime, Islamic doctors reviewed Greek medical texts to analyze greater approximately
breast most cancers. The Renaissance saw a revival of surgical procedure, with docs exploring
the human frame.

John Hunter, called the Scottish father of investigative surgical treatment, diagnosed lymph as a
purpose of breast cancer. Lymph is the fluid wearing white blood cells during the frame.
Lumpectomies have been additionally completed by surgeons, but there was no anesthesia, but
Surgeons needed to be speedy and accurate to achieve success.

There are some Breast Cancer Search Milestones: Our modern-day technique to breast most
cancers remedy and research began forming in the 19th century. recall these milestones:

1985: Researchers discover that ladies with early-level breast most cancers who were handled
with a lumpectomy and radiation have comparable survival costs to women handled with only a

1986: Scientists determine the way to clone the HER2 gene.

1995: Scientists can clone the tumor suppressor genes BRCA1 and BRCA2. Inherited mutations
in these genes can expect an expanded chance of breast cancer.

1996: FDA approves anastrozole (Arimidex) as a treatment for breast cancers. This drug blocks
the production of estrogen.

1998: Tamoxifen is observed to decrease the danger of growing breast most cancers in at-danger
women through 50 percent Trusted supply. It’s now permitted with the aid of the FDA for use as
a preventive therapy.

1998: Trastuzumab (Herceptin), a drug targeting cancer cells which can be over-generating
HER2, is likewise accredited by the FDA.

2006: The SERM drug raloxifene (Evista) is discovered to reduce breast most cancers risk for
postmenopausal ladies who have better threat. It has a lower risk of great aspect outcomes than
2010: "A hybrid intelligent system for breast cancer diagnosis" by Abirami et al.This paper
proposed a hybrid intelligent system that combines fuzzy logic and artificial neural networks to
improve breast cancer diagnosis accuracy.

2011: A massive meta-analysisTrusted supply finds that radiation therapy drastically reduces the
hazard of breast cancers recurrence and mortality.

2012: "A novel approach for automated detection of breast cancer using SVM classifier" by
Kourou et al. The authors presented a novel approach using support vector machine (SVM) for
automated breast cancer detection, achieving promising results.

2013: The 4 principal subtypesTrusted supply of breast cancer are described as HR+/HER2
(“luminal A”), HR-/HER2 (“triple poor”), HR+/HER2+ (“luminal B”), and HR-/HER2+

2014: "Deep learning for detecting breast cancer metastases on whole slide images" by Liu et al.
This study explored the application of deep learning techniques, specifically convolutional neural
networks (CNNs), for detecting breast cancer metastases in whole slide images.

2016: "Breast cancer diagnosis using a hybrid intelligent system" by Arun Kumar et al. The
authors proposed a hybrid intelligent system that combines rough set theory, fuzzy logic, and
genetic algorithm for breast cancer diagnosis, achieving high accuracy.

2017: the first biosimilar drug, OgivriTrusted source (trastuzumab-dkst), is accredited through
the FDA for breast cancer remedy. unlike generics, biosimilars are copies of biologic pills and
value less than branded drugs.

2018: A medical trial suggests that chemotherapy after surgical operation doesn’t benefit 70
percent of girls with early-level breast cancer.

2019: EnhertuTrusted supply is permitted by the FDA, and this drug proves to be very effective
in treating HER2-high quality breast cancer that’s metastasized or can’t be removed with
surgical operation.

2019: "Breast cancer diagnosis using a hybrid machine learning approach" by Zormpas-Petridis
et al. The authors developed a hybrid machine learning approach combining decision trees,
logistic regression, and random forests for breast cancer diagnosis, achieving competitive

2020: The drug Trodelvy is accredited through the FDA for treating metastatic triple-poor breast
cancer for individuals who haven’t replied to at the least other treatments

2020: "Efficient breast cancer classification using a machine learning approach with genetic
algorithm-based feature selection" by Elakkiya et al. This study employed genetic algorithm-
based feature selection and machine learning techniques for efficient breast cancer classification,
demonstrating improved accuracy.


3.1 Concept generation

Breast cancer amongst all other breast disease has become a significant concern due to its
potential as a silent killer without any obvious symptoms. Early prediction and prevention play a
crucial role in reducing the mortality rate associated with this deadly disease. ML techniques
offers various promising solutions for the analysis of breast cancer by testing various risk factors.
This proposed work aims to collect and analyze relevant data from diverse sources, classify the
data under suitable headings, and apply machine learning algorithms to predict the possibility of
breast disease. The objective is to empower healthcare professionals and individuals with
effective tools for early detection and prevention, ultimately reducing the mortality rates caused
by breast disease. Identifying and gathering relevant data from various resources including
medical records, patient’s histories, genetic data, and lifestyle factors. Preforming data
preprocessing tasks, such as data cleaning handling missing values, standardizing the data and
selection of features relevant for prediction.

In this project we have used breast disease data from repository of UCI []. The features of this
data are computed from a digital image of a fine needle aspiration (FNA) of a breast mass. We
have a total of 699 instances out which 458 instances belong to benign tumor and 241 belong to
malignant tumor. 10 clinical features have been recorded for each instance. In this paper, we use
python as a tool to implement breast disease classification and prediction training via various
machine learning algorithms; SVM, logistic regression, decision tree, random forest, KNN. After
compression of all the algorithms we use SVM for further processing of this project. We have
then created an local user interface using Streamlit in which users can enter values and the desired
output can be predicted.
3.2 Evaluation and selection of features
The working of system starts with the collection of data and selection of important attributes.
Then the data is pre-processed into the required format. The data it then divided into two parts
training and testing data. The models are then trained using the training data and the accuracy of
the models is obtained by testing the system using the testing data.

The following module are used to implement the system:

Stage 1: Data pre-processing

Stage 2: Data exploration

Stage 3: Feature selection

Stage 4: Feature scaling

Stage 5: Model selection

3.2.1 Stage 1: Data Pre-Processing

We will use UCI Machine Learning Repository for breast cancer dataset.

The dataset used in this project is publicly available and was created by Dr. William H.Wolberg,
physician at the University Of Wisconsin Hospital at Madison, Wisconsin, USA. To create the
dataset Dr. Wolberg used fluid samples taken by fine needle aspiration (FNA), taken from
patients with solid breast masses and an easy-to-use graphical computer program called Xcyt,
which is capable of perform the analysis of cytological features based on a digital scan. The
program uses a curve-fitting algorithm, to compute ten features from each one of the cells in the
sample, then it calculates the mean value, extreme value and standard error of each feature for
the image, returning a 30 real-valuated vector.
Attribute Information:

1. ID number
2. Clump_thickness
3. Size_unformity
4. Shape_unformity
5. Marginal_adhesion
6. Epithelial_size
7. Bare_nucleoli
8. Bland_chromatin
9. Normal_nucleoli
10. Mitoses’
11. Class (2 = benign, 4 = malignant)


The objective of this analysis is to observe which features are most helpful in predicting
malignant and benign cancer and to see a general trend that would help us in model selection. The
goal is to classify whether the breast cancer is benign or malignant. To achieve this, we have used
machine learning classification methods to fit function that can predict discrete class of new

3.2.2 Stage 2: Data Exploration

For this we will be using jupyter notebook to work on the dataset. We will first go on with
importing all necessary libraries and upload our dataset on to jupyter.

We can find the dimensions of the dataset using panda command data.shape (699,11)

We now know that we have a dataset that consist of total 699 rows and 11 columns. ‘class’ is the
column which we are going to predict, which says if the cancer is 2 = benign or 4 = malignant.

Using the code line ‘data['class'].value_counts()’ we can detect that out off 699 persons, 458 are
labeled as 2 (benign) and 241 are labeled as 4 (malignant).

Data visualization plays a crucial role in understanding patterns, relationships, and

trends within the dataset. In the context of breast cancer classification, effective data
visualization techniques can help uncover insights and facilitate decision-making.

Python has several visualization libraries such as Matplotlib, Seaborn.

3.2.2.a Bivariate data analysis

Bivariate data analysis involves examining the relationship between two variables in a
dataset. In the context of breast cancer classification, bivariate data analysis can help
uncover correlations, associations, or dependencies between different features.
3.2.2.b Multivariate data analysis

Multivariate data analysis involves examining the relationship and patterns among three
or more variables simultaneously. In the context of breast cancer classification,
multivariate data analysis can help uncover complex relationships between multiple
features and their combined impact on the classification task.

A heatmap is a graphical representation of data where the values of a matrix are
represented as colors. The rows and columns of the matrix represent variables or
categories, and the colors represent the values of the data.

The intensity of color represents magnitude of correlation between different attributes

of our dataset, the dark intensity represents low magnitude of correlation the lighter
color represents higher magnitude of correlation.

3.2.3 Stage 3: Feature Selection
Feature selection is the method of reducing the input variable to your model by using only
relevant data and getting rid of noise in data. The goal of feature selection is to improve the
performance of a model by reducing the dimensionality of the input data and focusing on the
most informative features.

In this dataset we see that


Splitting the dataset

The dataset we used is split into training and testing data. The training set contains a known
output and the model learns on this data in order to be generalized to other data later on. we have
the test dataset in order to test models prediction on this subset.
Splitting the dataset into training and test set from sklearn.model_selection import train_test_splitn
x_train,x_test,y_train,y_test= train_test_split(x, y,

3.2.4 Stage 4: Feature Scaling

Feature scaling is a technique used in machine learning and data preprocessing to standardize or
normalize the numerical features of a dataset. It is important to scale features when they have
different scales or units of measurement, as it can improve the performance of certain machine

learning algorithms.
3.2.5 Stage 5: Model Selection
This is the most exciting phase in Applying Machine Learning to any Dataset. It is also known as
Algorithm selection for Predicting the best results.

It involves evaluating and comparing different models to determine which one is likely to
perform the best on unseen data.

The algorithms are majorly classified into two groups: supervised learning algorithm and
unsupervised learning algorithms.

Without much due, I would like to give an over view of both the algorithms

Supervised learning algorithm:

Supervised learning is a type of machine learning where an algorithm learns from labeled
training data to make predictions or decisions. In supervised learning, the training data consists
of input features (also called independent variables) and corresponding output labels (also called
dependent variables or targets).

Supervised learning is further grouped into Regression and classification problems.

A regression problem is when the output variable is a real or continuous value, such as “salary”
or “weight”.

A classification problem is when the output variable is a category like filtering emails “spam” or
“not spam”.

Unsupervised learning algorithms:

Unsupervised learning is a type of machine learning where an algorithm learns from unlabeled
data to discover patterns, structures, or relationships without explicit guidance or predefined
output labels. In unsupervised learning, the algorithm explores the inherent structure within the
data to find interesting patterns or groupings.
In our dataset we have the outcome variable or Dependent variable i.e., Y having only two set of
values, either 4 (Malign) or 2 (Benign). So, we will use Classification algorithm of supervised

There are various types of classification algorithms in machine learning: -

1. Logistic regression
2. Nearest Neighbor
3. Support Vector Machine
4. Kernal SVM
5. Naïve Bayes
6. Decision Tree Algorithm
7. Random Forest Classification

In our project we use Support Vector Machine, KNN, Decision Tree Algorithm, Random Forest,
Logistic Regression.

To implement all these algorithms we import sklearn library.

Implementing Support Vector Machine

from sklearn.svm import SVC


we will now predict the test set results and check the accuracy with each of our models:




to check the accuracy, we need to import confusion_matrix method of metrics class. A confusion
matrix is a performance measurement tool used in machine learning and classification problems
to evaluate the accuracy of a predictive model. It is commonly used in supervised learning tasks,
such as binary or multi-class classification.
from sklearn.metrics import confusion_matrix
print('Confusion Matrix for SVM')
df_cm=pd.DataFrame(cm,index=[i for i in [2,4]],columns=[i for i in
['Predict B','Predict M']])

We will use Classification Accuracy method to find the accuracy of our models. Classification Accuracy
is what we usually mean, when we use the term accuracy. It is the ratio of number of correct predictions
to the total number of input samples.

Where TP+TN = Number of correct predictions, TP+TN+FP+FN = Total number of predictions.

After applying the different classification models, we have got to be low with different models:

1. Logistic regression
2. Nearest Neighbor
3. Support Vector Machine
4. Kernal SVM
5. Naïve Bayes
6. Decision Tree Algorithm
7. Random Forest Classification

So finally, we have built our classification model and we can see that Support Vector Machine
Classification algorithm gives the best results for our dataset. Well, it’s not always applicable to
every dataset. To choose our model we always need to analyze our dataset and then apply our
machine learning model.

3.3 Design Flow

Entire Code in Python programming language is provided in annexure-1 for reference.

3.3.1 Pseudo code for SVM Input:

Input: Training dataset (X_train, y_train)

Output: Trained SVM model (model)

1. Initialize an SVM model:

- Define the kernel type (linear, polynomial, radial basis function, etc.).

- Set the hyperparameters (e.g., regularization parameter C, kernel parameters)

2. Preprocess the data (e.g., normalization, feature scaling) if needed.

3. Train the SVM model:

- Fit the model on the training dataset:

- Create an optimization problem to find the hyperplane that maximizes the margin.
- Solve the optimization problem using an optimization algorithm (e.g., quadratic
programming, gradient descent).

- Obtain the coefficients and support vectors.

4. Store the trained SVM model.

5. Return the trained SVM model.

Input: Trained SVM model (model), Test dataset (X_test)

Output: Predicted labels for the test dataset (y_pred)

1. Preprocess the test data (e.g., normalization, feature scaling) if needed.

2. Make predictions using the trained SVM model:

- For each instance in the test dataset:

- Compute the decision function or distance from the instance to the hyperplane.

- Determine the class label based on the sign of the decision function.

- Assign the predicted label to the corresponding instance.

3. Return the predicted labels (y_pred).

3.3.2 Pseudo code for decision tree

function buildDecisionTree(data):

if all instances in data belong to the same class:

return a leaf node with the class label

if data is empty:

return a leaf node with the majority class label from the parent node


select the best attribute A to split the data based on a suitable criterion (e.g., information
gain, Gini index)

create a new decision tree node with attribute A

for each possible value v of attribute A:

create a new branch from the current node labeled with value v

partition the data into subsets that have attribute A equal to value v

if the subset is empty:

attach a leaf node with the majority class label from the parent node


recursively call buildDecisionTree on the subset of data and attach the resulting subtree
to the branch

return the decision tree root node

3.3.3 Pseudo code for KNN:

Input: Training dataset (X_train, y_train), Number of neighbors (k)

Output: Trained KNN model (model)

1. Store the training dataset and corresponding labels in the KNN model.
2. Store the value of k in the KNN model.

3. Return the trained KNN model.

Input: Trained KNN model (model), Test dataset (X_test)Output: Predicted labels for the test
dataset (y_pred)

1. For each instance in the test dataset:

- Compute the distances between the instance and all instances in the training dataset using
distance metric (e.g., Euclidean distance).

2. Sort the computed distances in ascending order.

3. Select the k nearest neighbors based on the sorted distances.

4. Determine the class labels of the selected k nearest neighbors.

5. Assign the class label that appears most frequently among the k nearest neighbors as the
predicted label for the instance.

6. Repeat steps 1-5 for all instances in the test dataset.

7. Return the predicted labels (y_pred).

3.4 Support Vector Machine (SVM):

Support Vector Machine (SVM) is a supervised machine learning algorithm used for
classification and regression tasks. It aims to find an optimal hyperplane that separates the data
into different classes while maximizing the margin between the classes.
3.4.1 Types of SVM:

SVM can be two types:

Non-linear SVM: is a variant of the Support Vector Machine (SVM) algorithm that uses a non-
linear kernel function to handle data that is not linearly separable. It allows SVM to capture
complex relationships and decision boundaries in the data by mapping it to a higher-dimensional
feature space. Non-linear SVM allows capturing complex relationships in the data and handling
situations where the decision boundary is not linear. By using different non-linear kernel
functions, it can capture various types of non-linear patterns and improve the classification

Linear SVM: is a variant of the Support Vector Machine (SVM) algorithm that uses a linear
kernel function. It is commonly used for binary classification tasks where the classes are linearly
separable or when the decision boundary is expected to be close to linear. Linear SVM is
particularly useful when dealing with large-scale datasets or when the classes can be separated
by a linear decision boundary. It is computationally efficient and has been widely used in various

The advantages of SVM:

1. Effective in High-Dimensional Spaces: SVM performs well in datasets with a large

number of features or dimensions. It is effective in high-dimensional spaces, making it
suitable for problems with many input variables.

2. Robust to Overfitting: SVMs are less prone to overfitting compared to other algorithms
like decision trees. They find the optimal hyperplane with maximum margin, which helps
in generalizing well to unseen data.

3. Versatile Kernel Functions: SVMs can utilize different kernel functions to handle various
data types and non-linear relationships. By choosing an appropriate kernel function,
SVMs can capture complex patterns and create non-linear decision boundaries.
4. Handles Non-Linear Data: With the use of non-linear kernel functions, SVMs can
effectively handle non-linearly separable data. By transforming the data into a higher-
dimensional feature space, SVMs can find linear decision boundaries in that space.

5. Sparse Solution: SVMs only rely on a subset of training samples called support vectors,
which are the data points closest to the decision boundary. This sparse solution makes
SVMs memory efficient and computationally faster, especially for large datasets.

6. Regularization Parameter Control: SVMs provide a regularization parameter (C) that

controls the balance between achieving a wider margin and minimizing the classification
errors. This parameter allows users to control the trade-off based on the specific problem
and the importance of avoiding misclassifications.

7. Well-Studied Theory: SVMs have a solid theoretical foundation in the field of statistical
learning theory. Their mathematical principles and optimization algorithms have been
extensively studied, providing a solid understanding of their behavior and properties.

8. Fewer Assumptions: SVMs make minimal assumptions about the underlying data
distribution. They only require the classes to be separable or nearly separable, making
them flexible and applicable to a wide range of problems.
3.4.2 Logistic Regression
Logistic Regression is a popular statistical and machine learning algorithm used for binary
classification tasks. It models the relationship between a set of independent variables (features)
and a binary dependent variable (target) using the logistic function.
Types of logistic regression
Logistic Regression can be extended and modified to handle various scenarios and types of
problems. Here are some common types of logistic regression:
1. Binary Logistic Regression:
 The standard form of logistic regression, used for binary classification tasks.
 It models the probability of an instance belonging to one of the two classes.
 The target variable is binary (e.g., 0 or 1, true or false).
2. Multinomial Logistic Regression:
 Also known as softmax regression or multinomial logit model.
 Used for classification problems with three or more mutually exclusive classes.
 It models the probabilities of an instance belonging to each class using the softmax
3. Ordinal Logistic Regression:
 Used when the target variable has ordered categories or levels.
 It models the cumulative probabilities of an instance falling into or above each
 Suitable for ordinal or ranked data, where the categories have a natural order.
4. Regularized Logistic Regression:
 Introduces regularization techniques to prevent overfitting and improve
 Common regularization methods include L1 regularization (Lasso) and L2
regularization (Ridge).
 Regularization adds a penalty term to the loss function, controlling the complexity of
the model.
5. Penalized Logistic Regression:
 Similar to regularized logistic regression, but with specific penalty terms.
 Penalized logistic regression aims to address specific issues such as imbalanced
classes or feature selection.
 Examples include Firth's penalized logistic regression and Elastic Net logistic
6. Bayesian Logistic Regression:
 Incorporates Bayesian inference principles into logistic regression.
 It models the posterior probability distribution over the parameters using prior
distributions and likelihoods.
 Bayesian logistic regression allows for uncertainty estimation and flexible model

You might also like