Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 86

DEVELOPMENT OF A BREAST CANCER DIAGNOSIS SYSTEM

USING CONVOLUTION NEURAL NETWORKS

BY

UKOHA, CHINONSO PRECIOUS


(17CG023225)

A PROJECT SUBMITTED TO THE DEPARTMENT OF COMPUTER AND


INFORMATION SCIENCES, COLLEGE OF SCIENCE AND TECHNOLOGY,
COVENANT UNIVERSITY OTA, OGUN STATE.

IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE AWARD OF


THE BACHELOR OF SCIENCE (HONOURS) DEGREE IN COMPUTER
SCIENCE

JULY 2021
CERTIFICATION
I hereby certify that this project was carried out by Ukoha Chinonso Precious in the
Department of Computer and Information Sciences, College of Science and Technology,
Covenant University, Ogun State, Nigeria, under my supervision.

1. Name: Dr Olanma Iheanetu


(Supervisor)

Signature ___________________________ Date ____________________

2. Name: Dr. Olufunke O. Oladipupo


(Head of Department)

Signature ____________________________ Date ____________________

PAGE ii
DEDICATION
This report is dedicated to my parents, Dr & Mrs. Ukoha for their ever-present support,
encouragement, and advice.

Most importantly, to the Almighty God, for the wisdom, guidance, understanding, and
favor all through my schooling experience. It was your grace from the start to finish.

PAGE ii
ACKNOWLEDGEMENT

My sincere appreciation goes to God, for His grace and his favor without which this study
would have been impossible.
My parents, Dr and Mrs. Ukoha, for their unfailing support and encouragement during my
academic years and the completion of this study, and to my siblings, for always keeping
me on track whenever I faced challenges.
A special thanks to my final year project supervisor, Dr Olanma Iheanetu, who made this
work a reality. Her advice was invaluable and I was led through all phases of my project
by her enlightening suggestions, support, and counsel. I'd also want to express my
gratitude to my final year defense panelists my experience a pleasurable one and for their
insightful remarks and recommendations.The HOD and the entire department of
Computer and Information Sciences, thank you for the tireless hours spent on impacting
priceless knowledge to my life. God bless you all.

PAGE ii
TABLE OF CONTENTS
Title Page
Certification

Dedication

Acknowledgement

Table of contents

List of figures

Abstract xiii

CHAPTER ONE: INTRODUCTION


1.1. Background Information

1.2. Statement of Problem

1.3. Aim and Objective of Study

1.4. Significance of the Study

1.5. Methodology

1.6. Limitations of the Study

1.7. Scope of the Study

1.8. Arrangement of Thesis

CHAPTER TWO: LITERATURE REVIEW


2.1. Introduction

2.1.1. Breast Cancer Risk Factors 10

2.1.2. Machine Learning 12

2.1.3. Machine Learning In Breast Cancer Diagnosis 14

2.1.4. Machine Learning In Breast Cancer Risk Prediction 14

2.2. Review of Existing Systems 15

2.2.1. Design and Implementation of a Fuzzy Expert System for Diagnosing

PAGE ii
Breast Cancer
15

2.2.2. Cedars Sinai Breast Health Assessment 16

2.3. REVIEW OF EXISTING PRINCIPLES/METHODS 17

2.3.1. Algorithm for the Detection of Breast Cancer in Digital Mammograms


Using Deep Learning 17

2.3.2. Machine Learning Classification Techniques for Breast Cancer


Diagnosis 17

2.3.3. Predicting Breast Cancer Risk Using Personal Health Data and Machine
Learning Models 19

2.3.4. Breast Cancer Histopathological Image Classification Using


Convolutional Neural Networks 20

2.3.5. Predicting Breast Cancer via Supervised Machine Learning Methods on


Class Imbalanced Data 20

2.3.6. Breast Cancer and Prostate Cancer Detection Using Classification


Algorithms 21

2.3.7. Development Of A Breast Cancer Risk Assessment Model Using A


Machine Learning Approach 22

2.4. Review of Related Findings 23

CHAPTER THREE: SYSTEM ANALYSIS AND DESIGN


3.1. Introduction 25

3.2. System Analysis 25

3.2.1. Analysis Of Existing Systems 25

3.2.2. The Proposed System 26

3.3. Requirements Analysis 26

3.3.1. User Requirements 27

PAGE ii
3.3.2. Functional Requirements 27

3.3.3. Non-Functional Requirements 27

3.4. System Architecture 28

3.4.1. Convolutional Neural Networks (CNNs) Architecture 28

3.4.2. Supervised Learning Architecture 32

3.4.3. Web Framework Architecture 34

3.5. System Design 34

3.5.1. Class Diagram 35

3.5.2. Sequence Diagram 37

3.5.3. Use Case Diagram 38

3.5.4. Activity Diagram 39

3.6. Description of Tables 40

3.6.1. User Creation Table 40

3.6.2. Patient Diagnosis Data Table 40

3.6.3. Patient Risk Assessment Data Table 41

CHAPTER FOUR: SYSTEM IMPLEMENTATION


4.1. Introduction 42

4.2. System Requirements 42

4.2.1. Hardware Requirements 42

4.2.2. Software Requirements 43

4.3. Implementation Process 43

4.3.1. Implementation Tools Used 43

4.3.2. Diagnosis Model Training Process 45

4.3.3. Risk Assessment Model Training Process 51

4.3.4. System Evaluation 56


PAGE ii
4.4. Program Modules and Interfaces 57

4.4.1. Landing Page Module 57

4.4.2. Signup Page Module 58

4.4.3. Login Page Module 58

4.4.4. Home Page Module 59

4.4.5. Diagnosis Page Module 60

4.4.6. Risk Assessment Page 62

CHAPTER FIVE: SUMMARY, RECOMMENDATION, AND


CONCLUSION
5.1. Introduction 65

5.2. Summary
65
5.3. Recommendation 65

5.4. Conclusion 66

REFERENCES

PAGE ii
LIST OF FIGURES
Title Page
Figure 2.1: Obi-Hep Breast Cancer Diagnosis System 16

Figure 3.1: Convolutional Neural Network Architecture 29


Figure 3 2: MobileNet Architecture 31
Figure 3.3: Supervised Learning Architecture 33
Figure 3.4: Class Diagram of the Breast Cancer Diagnosis System 36
Figure 3 5: Sequence Diagram of the Breast Cancer Diagnosis System 37
Figure 3.6: Use Case Diagram of the Breast Cancer Diagnosis System 38
Figure 3.7: Activity Diagram of the Breast Cancer Diagnosis System 39
Figure 4.1: Importing Necessary Libraries 46
Figure 4.2: Loading and preprocessing the image data 47
Figure 4.3: Visualising preprocessed images 47
Figure 4.4: Building the model and viewing the summary 48
Figure 4.5: Compiling the model 48
Figure 4.6: Training the diagnosis model 49
Figure 4.7: Accuracy and loss graphs against training and validation data 49
Figure 4.8: Confusion matrix of the Breast Cancer Diagnosis model on the test data 50
Figure 4.9: Saving the diagnosis model 50
Figure 4.10: Embedding the diagnosis model 51
Figure 4.11: Checking for null values 52
Figure 4.12: Viewing the basic statistics of the data 52
Figure 4.13: Distribution of the dataset columns 53
Figure 4.14: Importance of the target data to the dataset 53
Figure 4.15: Distribution of each age group in the dataset 54
Figure 4.16: Correlation matrix of the dataset 54
Figure 4.17: Splitting the dataset into train and test 55
Figure 4.18: Balancing the classes of the dataset 55
PAGE ii
Figure 4.19: Classification metrics of the risk assessment model 55
Figure 4.20: Pickling the Risk Assessment Model 56
Figure 4.21: Unpickling the Risk Assessment model 56
Figure 4. 22: System Evaluation Metrics 56
Figure 4.23: Landing Page of the Breast Cancer Diagnosis System 57
Figure 4.24: Landing Page of the Breast Cancer Diagnosis System 57
Figure 4.25: Signup Page of the Breast Cancer Diagnosis System 58
Figure 4.26: Login Page of the Breast Cancer Diagnosis System 59
Figure 4.27: Home Page of the Breast Cancer Diagnosis System 59
Figure 4.28: Perform Diagnosis Form 60
Figure 4.29: View Diagnosis Results Page 61
Figure 4.30: Diagnosis Results History 61
Figure 4.31: Risk Assessment Result Form 62
Figure 4.32: Risk Assessment Form 63
Figure 4.33: Risk Assessment Result 63
Figure 4.34: Risk Assessment History 64

PAGE ii
LIST OF TABLES
Title Page
Table 3.1: User Creation Table 40
Table 3.2: Table of the Patient Diagnosis Data 40
Table 3.3: Table of the Patient Risk Assessment Data 41

Table 4.1: Hardware Requirements


42
Table 4.2: Software Requirements 43

PAGE ii
PAGE ii
ABSTRACT

Breast cancer and other NonCommunicable diseases have been a great cause of concern to
Nigeria and other Low and Middle Income Countries (LMICs). The inadequacy of skilled
personnel and infranstructure has led to the high rate of breast cancer misdiagnosis and
mortality. The insufficient amount of indigenous data in the nation has further aggravated
the situation. This study aims to automate the process of breast cancer diagnosis by
developing a web-based Breast Cancer Diagnosis System using Convolutional Neural
Networks (CNN) which will take into consideration risk factors of patients into analysis
for breast cancer preventablilty and also serve as a standardized data repository for patient
data. To carry out this study, an extensive review of existing systems and journals were
carried out to discover which machine learning algorithms have previously diagnosed
breast and performed breast cancer risk assessment efficiently. Using tensorflow python
library and jupyterlab, a MobileNet Convolutional Neural Network (CNN) model was
used to diagnose breast cancer biopsy images using the Breast Cancer Histopathological
Image Classification (BreakHis) dataset for training and a Logistic Regression model was
used to classify breast cancer risk as high or low using the Breast Cancer Surveillance
Consortium (BCSC) Risk Estimation dataset for training. A web based application was
built using HTML, CSS, Bootstrap and Django web framework for the server side with
MySql for the database creation and the models were embedded into the system. It was
discovered that MobileNet CNN architecture can efficiently diagnose breast cancer with a
training accuracy of 93.8%. On further evaluation with 9 data samples from Clinix
hospital, the system predicted 7 samples correctly achieving of 77.8%. Logistic regression
classifier was able to accurately predict breast cancer risk with an accuracy of 99%. This
system can help to serve as a second opinion to pathologists or substitute for pathologists
in cases where they are unavailable.

PAGE ii
CHAPTER ONE

INTRODUCTION

1.1. BACKGROUND INFORMATION


The World Health Organization's (WHO) Constitution, which came into play on the 7 th of
April, 1948, described health as "a state of complete physical, mental, and social well-
being." The organization maintained that health is a condition determined by the
occurrence or absence of diseases: so they introduced to the definition that a person must
be free of disease in order to qualify as healthy. (W.H.O, 2008).

Disease can be specified as any adverse diversion from an organism's natural state. It
varies from normal bodily damage in that it is generally followed by some
clinical symptoms. Diseases may have their origins within the individual, they may be the
product of a medical procedure, or they may be triggered by a foreign agent such as a
toxic chemical. In the latter situation, the illness is Noncommunicable, meaning that it
only attacks the individual that is subjected to it. The World Health Organization (WHO)
has recognized four major types of Noncommunicable Diseases (NCDs): cardiovascular
disease, chronic respiratory disease, diabetes mellitus and cancer (Burrows, W. and
Scarpelli, 2020). Cancer is responsible for a vast majority of Noncommunicable Diseases
(NCDs) deaths in the world (WHO, 2021b).

Cancer is one of the most prevalent cause of NCD mortality globally, responsible for
almost ten million deaths each year. Cancer is also accountable for nearly one in every six
deaths worldwide (Ferlay et al., 2019). The most commonly diagnosed cancers globally in
2020 were breast, lung, colon and rectum, prostate, non-melanoma skin, and stomach
cancers. The leading causes of global cancer mortality in 2020 were lung cancer, colon
and rectum cancer, liver cancer, stomach cancer and breast cancer. Cancer accounts for 70
percent of mortality occurrences in developing countries. (WHO, 2021a).

PAGE ii
In Nigeria, cancer is responsible for 70,000 deaths annually (28 414 for male and 41 913
for female). Breast cancer, cervix uteri, prostate, non-Hodgkin lymphoma, and liver
cancer are the five cancers with the largest reported occurrence in the nation. Breast
cancer, cervix uteri, prostate, liver, and non-Hodgkin lymphoma have the highest
projected death rates. Breast cancer is now Nigeria's most diagnosed cancer (Ferlay et al.,
2019).

Breast cancer occurs when some cells in the breasts start to grow unevenly. The affected
cells create a lump as they split up at a very rapid rate when compared to unaffected cells
and keep growing until they eventually affect the nodes in the lymphs and subsequently
the rest of the body (Mayo Clinic, 2021). Breast cancer affects both genders but women
are more succeptible to the disease. This is because the male breast tissue is all fat and
fibrous tissue called stroma, and they have less ducts and lobules than female breast
tissue. Also, women have higher estrogen levels than men, which can increase their risk
of breast cancer (CCTA, 2019).

The number of women at risk of breast cancer in Nigeria has gradually grown since 1990,
when it was about 24.5 million to nearly 40 million in 2010 and is expected to rise above
50 million in 2020 (Olatunji et al., 2019). Nigeria has one of the greatest age-standardised
incidence rates (ASR) of breast cancer in Sub-Saharan Africa, next just to South Africa,
and less than Europe and North America. In 2010, the country's ASR for breast cancer
was 54.3 per 100,000 people. Despite being much less than Belgium's rate of 111.9 per
100,000 women and the United States' rate of 92.9 per 100,000 women, it reflected a 100
percent rise in breast cancer incidence in the country over the last ten years. Nigeria also
has the world's third highest breast cancer death rate (25.9 per 100,000 women), resulting
in the deaths of half the women affected (Emilia, 2017).

The breast cancer-related mortality rates keep increasing in Nigeria because the public
and health-care providers lack understanding on the importance of early detection of
breast cancer, and as a result, severe stage at diagnosis remains the norm. This is bad for
PAGE ii
patients’ prognosis as the stage at breast cancer diagnosis is among the most important
prognostic factors. (Emilia, 2017). There is also a significant lack of healthcare
professionals in Nigeria. With at least 3,000 doctors graduating per year, the Nigeria
Medical Association (NMA) estimates that the country will need about twenty five years
(25) to produce enough doctors to meet the country's needs (Agency Report, 2019).

This situation is especially worse in the field of pathology. Pathologists are very important
to breast cancer diagnosis because they examine breast tissues gotten from biopsy, which
is the surest method of diagnosis for determining if cells are cancerous or not, to derive
some important breast cancer values. According to the College of Nigerian Pathologists,
the total number of pathologists is said to be only 500. This makes the ratio of
pathologists to the entire population to be a concerning 1:400,000 persons. This goes
against the recommended standard of one pathologist to about 40,000 people. At the rate
at which the nation is going, Nigeria will take about 500 years to achieve the patologist-
patient ratio that exists in the United States and the United Kingdom today. This is
detrimental to the healthcare system because pathology is used to render evidence-based
diagnoses of 70 to 80 percent of diseases (College of Nigerian Pathologists, 2020).

The lack of pathologists often leads to misdiagnosis and according to the Care
Organization Public Englightenment (COPE) 70 percent of cancer patients are
misdiagnosed in Nigeria (Chioma, 2020). The most common reasons for long intervals for
breast cancer tumor staging in Nigeria is symptom misinformation and misdiagnosis
(Agodirin et al., 2019). In the case of Mary Abia, a misdiagnosis of her biopsy reports led
to her ultimately progressing to stage IV cancer which is the most deadly cancer stage
with the worst prognosis (Ake, 2018). There is also a severe lack of awareness of breast
cancer risk factors and issues among Nigerian women with most women attributing their
symptoms to spiritual attacks (Agatha Ogunkorode et al., 2021).

There are a number of risk factors of which affect an individual’s chances of getting
affected by breast cancer. While all cancers are caused by several mutations, these
mutations are caused by environmental interaction. Studies show that the majority of
PAGE ii
cancers are not inherited and that environmental influences such as eating patterns,
obesity, alcohol intake, and infections have a substantial effect on their growth. The fact
that cancer has a lower genetic impact and that environmental causes can be changed
suggest that cancer can be avoided (Anand et al., 2008). Risk assessment of breast cancer
is done by risk factors taking into account and evaluating those factors to determine breast
cancer susceptibility level. It is important to perform proper risk assessment in order to
carry out preventive measures, prolong and improve quality of life and foster early
detection and diagnosis of breast cancer (Akinnuwesi et al., 2020).

Research into how to improve quality of life and the health care system as a whole has
been hindered due to the country’s lack of standardized data. According to Nigeria’s
minister of health, Isaac Adewole, Nigeria struggles with high burden diseases and must
encourage data collection to foster research on how to combat those diseases. He also
emphasized on how the lack of data collection as a whole is restricting the interventions
of the health sector to perform practical and evidence-based control programmes for the
NonCommunicable Diseases in the country. The commissioner of health in Lagos state,
Jide Idris, also described the need to utilize Information and Communication Technology
to aid the healthcare system in Nigeria to minimize the obstacles faced due to inadequacy
of data. It is evident that a proper and reliable data collection system will be key in
overcoming the challenges brought on by breast cancer high growth rate and diagnosis
issues (Anthonia Obokoh, 2019).

Diagnosis of breast cancer can be performed by breast examination, mammograms, breast


ultrasound and biopsy which involves removing a sample of breast cells for testing. The
interval between when a women experiences her first breast cancer symptoms till when
she is diagnosed may have a huge impact on the stage at which she is diagnosed and her
chances of recovery. Longer intervals between a breast cancer diagnosis and the start of
treatment are associated with a worse prognosis, since these delays can result in stage
progression and breast cancer mortality (Emilia, 2017).

PAGE ii
The fundamental factors that determines breast cancer survivability are proper risk
assessment, as well as early and accurate diagnosis. It is important for breast cancer to be
detected and risk assessment to be evaluated as early as possible as it plays a significant
role in reducing the mortality rate of patients because patients can begin receiving the
necessary treatments as soon as possible (Akinnuwesi et al., 2020). Hence, there is a dire
need to employ strategies to manage maldistribution and other human related factors in
the healthcare industry. One of most efficient strategy is to employ the use of artificial
intelligence and machine learning techniques.

Machine learning techniques have been applied to breast images and biopsy records
classification in the form of Computer-Aided Detection (CAD) technologies, to aid
doctors in reading and decoding medical images and reports. The aim of CAD programs
is to improve sensitivity and precision so that doctors can make better diagnoses.
(Akinnuwesi et al., 2020). Machine learning techniques can also be applied to breast
cancer risk assessment to determine the effective risk factors and their association. The
models built from these techniques can be used for prediction and estimation of an
individual’s succeptibility to breast cancer. Efficient risk assessment can enhance the
success of treatment, improve survivability chances and minimize expenses. Hence it is
necessary to build a dependable and efficient machine learning models to perform proper
breast cancer risk assessment. (Al-Quraishi et al., 2017).

1.2. STATEMENT OF PROBLEM


It has been established that early diagnosis of breast cancer is very important to improve
patient’s prognosis and reduce overall mortality rate. However in Nigeria, breast cancer is
typically diagnosed at a progressed stage (Emilia, 2017). This is due to a number of
factors such as :
 The pathologist to patient ratio in Nigeria is 1 : 400,000 (College of Nigerian
Pathologists, 2020).
 Misdiagnosis of breast cancer kills 70% of infected persons in Nigeria due to
human errors. (Chioma, 2020).

PAGE ii
 Symptom/Risk Factors mismanagement leads to worse prognosis of breast cancer
(Agodirin et al., 2019 ).
 Lack of standardized datasets in Nigeria is restricting the research interventions to
combat Noncommunicable diseases such as breast cancer (Anthonia Obokoh,
2019).

1.3. AIM AND OBJECTIVE OF STUDY


This project aims to develop a web-based Breast Cancer Diagnosis System using
Convolutional Neural Networks (CNN) which will take into consideration risk factors of
patients into analysis for breast cancer preventablilty and also serve as a standardized data
repository for patient data.
The achieve this aim, the following objectives have to be met:
 To review existing systems and literature on artificial breast cancer diagnosis
systems algorithms
 To design a model of a web-based artificial breast cancer diagnosis system
 To train two models to effectively perform breast cancer diagnosis and risk
assessment and build the web-based artificial breast cancer system
 To test and evaluate the system.

1.4. SIGNIFICANCE OF THE STUDY


 The system will act as a second opinion or substitute for a pathologist
 The system will aid preventability of breast cancer
 The system will aid to create a database of indigenous breast cancer data

1.5. METHODOLOGY
The techniques and tools used to achieve the objectives of the study include:

 Objective 1: To achieve this objective, extensive literature review will be carried


out to gather knowledge about the current breast cancer diagnosis process and the
applications of machine learning to breast cancer risk assessment and diagnosis.

PAGE ii
This is in order to gather knowledge about limitations of the existing systems and
the most efficient machine learning algorithm in providing a solution to the
problem statement.

 Objective 2: The second objective is to design/model the breast cancer


management system. The design activities include Architecture design, Interface
Design and Component Design. Unified Modelling Language (UML) diagrams
will be used in architectural and component design to describe the general
configuration of the system, the key components and their relationship. Figma
design tool will be used to design the user interfaces.

 Objective 3: Model 1 will be trained on histopathology image data to effectively


diagnose breast cancer and using CNNs. Model 2 will be trained using breast
cancer risk factor data to effectively predict risk level as low or high using
Logistic Regression classifier. The system will be built using HTML5, CSS3 and
Bootstrap CSS framework for the front – end structure, while Django will be used
for the back-end structure.

 Objective 4: This objective will be achieved by, carrying out component testing,
system testing, acceptance testing and evaluating the diagnosis model performance
on unknown data.

1.6. LIMITATIONS OF THE STUDY


The limitations of the study include:
 The diagnosis dataset was not annotated by a pathologist.
 The risk assessment dataset was highly unbalanced.
 The diagnosis of breast cancer may be restricted to one diagnostic method.

1.7. SCOPE OF THE STUDY

PAGE ii
The breast cancer risk assessment tool will only classify breast cancer risk as high risk
and low risk and will not give a 5 year and lifetime risk prediction. The diagnosis data is
limited to women only.

1.8. ARRANGEMENT OF THESIS


Chapter One of the project contains an introduction to the project, statement of problem,
aim, objectives, the significance of study, methodologies used and the project outline.
Chapter Two includes a synopsis of the literature, an assessment of current systems, and a
collection of publications, papers, and studies relevant to artificial intelligence and breast
cancer.
Chapter Three presents the web application system and classification algorithms design,
which includes the physical and logic design of the system; it presents the system
architecture and conceptual design.
Chapter Four contains the system implementation, the tools used, the development
methodology, program modules, interfaces and system development process.
Chapter Five includes a basic summary of the project, as well as findings from previous
studies in relation to the new system. It also contains the outcome and importance of the
study, as well as recommendations for the research community or system user.

PAGE ii
CHAPTER TWO

LITERATURE REVIEW

1.9. INTRODUCTION
Cancer is typically named for the part of the body where it first appeared; so, therefore,
breast cancer refers to when breast tissues grow in an accelerated and out-of-control
manner. (Sharma et al., 2010). Female breasts are made up of a variety of tissues. The
glandular tissue (lobules) that contains milk, the fatty tissue that specifies breast size, and
the fibrous tissue that binds glandular and fatty breast tissue in place are the various forms
of breast tissue. The muscles that attach the breasts to the ribs are not part of the breast
anatomy. (Cleaveland Clinic, 2020).

Breast cancer may originate from a range of sites in the breast. Ductal cancers arise from
the ducts that transport milk to the nipple, while lobular cancers arise from the glands that
produce breast milk. Breast cancer forms, such as phyllodes tumor and angiosarcoma, are
not as prevalent. A small percentage of breast cancers originate from other tissues.
(American Cancer Society, 2019). Tumors can grow in different areas of the breasts. Most
breast cancers develop by benign (non-cancerous) changes. An instance of this is a
fibrocystic alteration, a non-cancerous tumor in which women experience cysts, fibrosis,
unevenness, and regions of swelling, soreness, or painful breast. (Sharma et al., 2010).

Breast cancer can be divided into two types: non-invasive and invasive. Non-invasive
breast cancers, also known as carcinoma in situ or pre-cancers, remain located within the
breast's milk ducts, and they do not infiltrate the normal tissues inside or outside the
breast. On the other hand, invasive breast cancers break through the lobular walls and
infiltrate the neighboring regular and stable fatty and connective tissues. Most breast
cancer types are invasive. Cancer type and patient therapy response are determined by

PAGE ii
breast cancer metastasis, i.e., cancer spread. Breast cancer can generally be classified as
one of the following:
 Ductal Carcinoma In Situ
 Lobular Carcinoma In Situ
 Invasive Ductal Carcinoma
 Invasive Lobular Carcinoma
 Inflammatory Breast Cancer
 Male Breast Cancer
 Paget's Disease of the Nipple
 Phyllodes Tumors of the Breast
 Metastatic Breast Cancer.

1.9.1. Breast Cancer Risk Factors


Any factor that increases your chances of contracting a disease, such as cancer, is referred
to as a risk factor. Some women may one breast risk factor or several breast cancer risk
factors and never get the disease, while some others may develop breast cancer without
any breast cancer risk factors. It is difficult to tell the role that risk factors play, in the
event that a woman with breast cancer risk factors develops breast cancer. (American
Cancer Society, 2020).

Many people are predisposed to breast cancer due to unavoidable risk factors such as
gender and age. Other risk factors, such as a person's family history, cannot be changed.
However, some risk factors, such as alcohol consumption and body weight, can be altered
(University of California San Francisco, 2020).The factors that raise the risk of having
breast cancer are listed below, including both those that cannot be changed and those that
can (American Cancer Society, 2020).

The following are risk factors that cannot be changed:


 The most important breast cancer risk factor is gender. Men are succeptible to
breast cancer but it is 100 times more prevalent in women, due to the presence of

PAGE ii
estrogen hormone, which is responsible for irregular cell growth, in women
breasts.

 Age is the also a major breast cancer risk factor. Breast cancer is more likely to
develop as a person gets older.

 Race also plays an important role in breast cancer risk level. White women are
likelier than Black, Latino, and Asian women to develop breast cancer, but black
women tend to develop late-stage breast cancer with a poor-prognosis which is
diagnosed at an early age and also have a higher mortality rate.

 Personal history is also a factor, as a woman's chance of getting cancer in the


next breast is expected to increase if she has had cancer in one breast.

 Family history is also a factor, as women that have a first-degree family member
who previously had or currently has breast cancer, are more succeptible to the
disease.

 A woman’s age at menarch is also a risk factor. Women who start menstruating
before the age of 12 are thought to be at a marginally greater risk. This is because
the level of female sex hormones a woman is exposed to over her lifespan is also a
risk factor. The more a woman is exposed to the greater her chance of getting
breast cancer.

 The age at a woman at her first birth is another risk factor. Women who give
birth before 29 are more succeptible to breast cancer, while women who give birth
after 29 or do not have children are less succeptible to the disease.

 The age of a women when she reaches menopause is also a risk factor. Women
that reach menopause when they are 54 or younger have a reduced risk than

PAGE ii
women who reach their menopause after they are 54. This increased risk may be
linked to their elevated lifetime exposure to sex hormones.

 Women with higher breasts density are more succeptible to the disease and their
breast cancer tumors are more difficult to detect.

The following are risk factors that can be altered:


 When a woman is overweight or obese after menopause she is more likely to
develop breast cancer. Increased estrogen levels will make women with fatty
breasts more succeptible to breast cancer. In addition, women who are overweight
have an increased level of insulin which has been attributed to breast cancer
tumors.

 Physical exercise has been attributed to reduced of breast cancer risk levels in
postmenopausal and premenopausal women. Increased levels of physical exercise,
especially in postmenopausal women, reduces breast cancer succeptibility.
 Consumption of alcoholic drinks increases the risk of HER2 positive breast
cancer. Alcohol increases the level of estrogen hormone which is related to HER2
postive breast cancer. Alcohol also destroys the DNA in cells which can in turn
increase breast cancer susceptibility.

 Women that use Hormone Replacement Therapy (HRT) for a long time have a
greater chance of developing breast cancer when compared to women that do not
use it. If women stop their HRT intake, their risk of breast cancer decreases, but
some additional risk persists for more than ten years.

All women is at a level of risk for breast cancer, but the risk level varies greatly from one
woman to the next. Understanding risk is crucial because it impacts medical choices
ranging from whether a symptom-free woman can get a mammogram and how actively to
seek preventive measures such as anti-estrogens or prophylactic mastectomy and ovary
removal.
PAGE ii
1.9.2. Machine Learning
By concept, machine learning is a field of computer science that arose from analysing
pattern recognition and intelligence. It is the method of designing algorithms that can
learn from experience and predict outputs using data sets. Instead of following linear
static instruction set, these processes work on the basis of input variables to make guided
decisions or judgements. (Simon et al., 2020).

Broadly speaking there are three main machine learning methods. These include:
 Supervised learning: Supervised machine learning algorithms are algorithms that
require extra guidance. The training and testing datasets are separated from the
input dataset. An output attribute, that requires prediction or classification, is
included in the train dataset. For estimation or grouping, all algorithms learn
trends from the train set and add these to the test set. Examples include decision
tree algorithm, naïve baye algorithm, support vector machine algorithm e.t.c.

 Unsupervised learning: The algorithms for unsupervised learning only learn a few
features depending on the data it is introduced to. It uses formerly taught
features to identify the class of new data. It is mostly used for feature reduction
and clustering. Examples include k-Means clustering, principal component
analysis and so on.

 Reinforcement learning: Semi-supervised learning algorithms incorporate the


benefits of supervised and unsupervised learning techniques. It is useful
when there is only unlabeled data and obtaining the labelled data is a time-
consuming task.

Machine learning has been used in various areas of medicine such as cardiology, cancer
diabetes e.t.c. Most machine learning models built in laboratory are used for prognosis,
diagnosis, or classification of clinical groups, which shows that they may be useful in the
creation of automated decision support tools. Big databases and precise labels, which are
PAGE ii
usually given by expert practitioners, are important for developing these methods. The
premise is to find the data models or factors that are related to the desired result (eg, the
cancer diagnosis of a woman). This situation helps to obtain important insights from the
given data, so that patients can keep track of their overall wellbeing and healthcare
practitioners can make appropriate decisions on treatment and management.
(Triantafyllidis & Tsanas, 2019).

1.9.3. Machine Learning In Breast Cancer Diagnosis


Machine learning has two components: feature extraction and classification. The former is
used for deriving essential features while the latter differentiates the provided data
depending on the derived features. A major drawback of machine learning it can only
derive just a restricted range of characteristics from pictures when processing them, and it
is can not derive distinguishing characteristics from the training data. This drawback is
resolved by using deep learning.

Deep learning is a field of machine learning that learns from its personal algorithmic
system. It works based on how human beings make judgments and systematically
disseminates knowledge into a uniform framework. This is accomplished by integrating
many algorithms into a multilayer network known as an artificial neural system (ANN)
which was created to model the way the human brain assimilates information using the
biological neural network. Therefore, deep learning is said to be more efficient when
compared to baseline machine learning models (Krishna et al., 2018). To perform breast
image classification, specialized technology of natural image classification and machine
learning techniques are primarily used. This helps medical practitioners to have an
assistive tool while still saving them time (Nahid & Kong, 2017).

1.9.4. Machine Learning In Breast Cancer Risk Prediction


Predictive analysis aims to forecast future patterns and results. Machine learning methods
and regression methods are the two types of methods used to perform predictive analytics.
Owing to its high success in managing big scale datasets with standardized features and
noisy data, machine learning methods are now more prevalent in performing predictive
PAGE ii
analysis. Health care, data protection, education, credit card fraud prevention, social
media, cloud storage, software measurement, consistency and fault prediction, expense
and commitment calculation, and software reuse have all benefited from innovative
predictive models. (MacIntyre, 2014).

The contribution of machine learning in breast cancer risk assessment has had a positive
impact on breast cancer treatment. It has the ability to decrease breast cancer deaths by
aiding with the early identification of the disease. It has since ensured that the risk of
developing breast cancer is correctly measured and detected. This is in order to encourage
early detectiom of women with high risk of getting the disease so as to take the necessary
preventability measures, as well as to prevent breast cancer misdiagnosis. It has the ability
to bring down the cost of breast cancer treatment for health care providers and patients.
(Akinnuwesi et al., 2020). Since machine learning algorithms are not limited to a fixed set
of risk factors, they have the ability to modify or integrate new (Krishna et al., 2018) ones
and deliver the promising prospect of better and more accurate risk estimates. (Ming et
al., 2019).

1.10. REVIEW OF EXISTING SYSTEMS


This section covers a surface study on the existing systems related to methods used for the
study
1.10.1. Design and Implementation of a Fuzzy Expert System for Diagnosing Breast
Cancer
In this system a fuzzy expert method for breast cancer diagnosis and treatment advice was
proposed, which provides data on cancer type and clinical recommendations for doctors
and patients. The data used in this study were based on information obtained from
interviews and internet sources. JAVA programming language, MATLAB, and the
SQLite database engine were used to build the framework. The device can be used by
either an administrator or a practitioner. In order to avoid manipulating the rules or
records, access management is often required to handle the type of user who may operate

PAGE ii
on the device. The system's usage can be controlled and monitored by an administrator
who is in charge of the application's access information (Okikiola et al., 2016).

The device can be used by either an administrator or a practitioner. In order to avoid


manipulating the rules or records, access management is often required to handle the type
of user who may operate on the device. The system's usage can be controlled and
monitored by an administrator who is in charge of the application's access information.
The limitation of this study is that it uses symptoms to diagnose using an expert system
which will not be as accurate as using Convolutional Neural Networks (CNNs) to
diagnose using histopathology images and it does not contain a risk assessment tool.

Figure 2.1: Obi-Hep Breast Cancer Diagnosis System (Okikiola et al., 2016)

1.10.2. Cedars Sinai Breast Health Assessment

PAGE ii
The cedars-sinai breast health assessment is an online tool for performing breast cancer
risk assessment. The system takes in a range of values including breast cancer risk factors
such as age, age at menstruation, age at menopause, weight e.t.c and gives a risk value
based on each of these features. The limitation of this tool it’s prediction model was
trained predominantly on white women data and as such may not predict accurately for
black women.

1.11. REVIEW OF EXISTING PRINCIPLES/METHODS


1.11.1. Algorithm for the Detection of Breast Cancer in Digital Mammograms Using
Deep Learning
The aim of this research was to use deep learning to build a method for detecting breast
cancer in digital mammograms. The database used to develop this method is the Health
Cooperative for "The Digital Mammography DREAM challenge," which is free to the
public. It is made up of 500 mammogram images in DICOM format, ranging in scale
from 3328x2560 to 5928x4728 pixels. It also contains annotated files that help
differentiate between normal and cancer cases.

The images were preprocessed by eliminating the label anomaly that is found in all
images in the archive, equalizing the intensity levels in the images, and decreasing the
image quality and contrast. The rectified linear unit (RelU) was used as a nonlinear layer
to build a convolutional neural network architecture by developing a convolutional layer
with thirty filters and a five by five kernel size, three two by two pooling layers with two
strides (that takes significant images and decreases them), and four fully connected
layers.

The contribution of this method was the preprocessing of mammogram images with the
contourlet transform and the proposal of a new neural network topology of layers suited to
the role of breast cancer detection. The limitations of this study are using a limited dataset
and using mammography images for training the model which is ineffective in identifying
breast cancer in dense breasts. Future work involves testing the proposed algorithm with a
larger database to avoid a possible overfitting (Pirouzbakht & Mejía, 2017).
PAGE ii
1.11.2. Machine Learning Classification Techniques for Breast Cancer Diagnosis
The focus of this research is to introduce an automated approach to diagnose breast cancer
that follows a logical workflow.The Wisconsin Diagnostic Breast Cancer (WDBC)
dataset, contributed on the 1st of November, 1995, was used in this research. It contains
569 instances, 357 of which are non-cancerous and 212 of which are cancerous. It has 32
attributes, including two class attribute labels (B= benign, M= malignant), an ID number,
and 30 real-valued attributes. These attributes were derived through digitising images of a
fine needle aspiration procedure performed on a breast tumor and are used to explain the
properties of the cell nuclei in the image.

The data selection and preprosessing were performed by data cleaning, data partitioning
into testing and validation data, feature selection by Recursive Feature Elimination (RFE)
and Corerelation-based Feature Selection (CFS) methods, and feature extraction by
Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA). The
machine learning classification techiques carried out for classification were Support
Vector Machine (SVM), Naïve Bayes Classifier (NBC) and Artificial Neural Networks
(ANN) on the processed data.

The SVM-LDA and ANN-LDA models outperformed other classification models in


simulations. SVM-LDA, on the other hand, was more effective than ANN-LDA as ANN-
LDA took longer to compute. As a result, this paper suggested an intelligent method for
breast cancer diagnosis that incorporates linear discriminant analysis and support vector
machine (with Radial Basis Function kernel). The classification metrics achieved in the
study include: an accuracy was 98.82 percent, a sensitivity was 98.41 percent, a precision
was 99.07 percent, and the area under the receiver operational characteristic curve was
0.9994.

The study illustrated that machine learning techniques such as the Support Vector
Machine (SVM), Artificial Neural Networks (ANN), and Nave Bayes (NB) can
efficiently improve the detection of non-cancerous and cancerous tumours. It also served
PAGE ii
as a base for a comparative study of the aforementioned methods and how feature
selection and feature extraction can assist in the selection of an appropriate machine
learning algorithm while constructing an adaptive intelligent model.

The limitation of this study is that it makes use of a small dataset, does not have an
interface and uses numerical data for training the model which can also be prone to
misdiagnosis. Prospects of the study includes developing the proposed approach into a
feasible practical tool for assisting and assisting doctors with a fast second opinion of
breast cancer diagnosis, comparing more machine learning algorithms used for breast
cancer diagnosis, analysing the obtained results and applying the proposed approach to
more disease options (Omondiagbe et al., 2019).

1.11.3. Predicting Breast Cancer Risk Using Personal Health Data and Machine
Learning Models
The study aimed to create machine learning algorithms that can accurately determine
breast cancer risk in five years with a higher accuracy than the standardized Breast Cancer
Riak Assessment Tool (BCRAT), and therefore foster early detection and prevention of
breast cancer.The Prostate, Lung, Colorectal and Ovarian (PLCO) data set was used in
this study. The data collection was created as part of a longitudinal, random, controlled
trial to assess the efficacy of various prostate, lung, colorectal, and ovarian cancer
screenings. Contributors completed a survey form outlining their past and existing health
problems from November 1993 to July 2001. The dataset processing was done entirely in
Python.

The study was performed using logistic regression, Gaussian naive Bayes, decision tree,
linear discriminant analysis, support vector machine, and feed-forward artificial neural
network. to The success of these six machine learning classifier models in estimating the
likelihood of individuals developing breast cancer in the next five years based on the
PLCO survey form data was tested. These machine learning models were chosen as they
all have distinct benefits that can make them the best for fulfilling the aim of the study.
The implementation code was written in the Python programming language. The Python
PAGE ii
scikit-learn package was used to build the logistic regression, naive Bayes, decision tree,
support vector machine, and linear discriminant analysis models. TensorFlow, a Python
library, was used to build the neural networks. The biases were configured as constants
and all neural network weights were Xavier configured. The neural networks had logistic
activation functions after each hidden layer and after the output layer. The loss was
described using cross-entropy, and minimized using an Adam optimizer.

The study's drawbacks include the absence of adequate biopsy or atypical hyperplasia
evidence in the PLCO dataset, as well as the fact that models were trained and tested on
different parts of one data set. The prospects of the study include evaluating how
effectively the machine learning models generalise to unseen data after training on the
total PLCO data set, determining if the selected machine learning models perform better
when trained on a larger data set after feature selection, and comparing the machine
learning models to the BCRAT when all models are trained with the seven BCRAT inputs
(Stark et al., 2019).

1.11.4. Breast Cancer Histopathological Image Classification Using Convolutional


Neural Networks
The research aimed to utilize convolutional neural networks to classify breast cancer
histopathological picture classification. This research made use of the BreaKHis index.
The photographs in this database are microscopic histopathological images of benign and
malignant breast tumors. The images were taken as part of a clinical review from January
till December 2014. During this time, all patients referred to the P&D Lab in Brazil with a
clinical diagnosis of breast cancer were invited to participate in the research.

A version based on the AlexNet architecture was used in the process. The design
contained three convolutional layers with receptive fields (kernels) of size 55, zero-
padding set to 2 and the stride set to 1, and one pooling layer after each convolutional
layer with each set to use a 33 receptive area. When comparing standard machine learning
models to CNN, the study's main contribution was increased accuracy. In order to
increase accuracy, future research might look at various CNN architectures, hyper-
PAGE ii
parameter optimisation, and techniques for selecting representative patches (Spanhol et
al., 2016).

1.11.5. Predicting Breast Cancer via Supervised Machine Learning Methods on


Class Imbalanced Data
The purpose of this research was to apply three different class balancing methods to the
Breast Cancer Surveillance Consortium (BCSC) dataset: oversampling, under-sampling,
and a hybrid method before building the supervised learning methods. The BCSC Data
Resource provided the data for this analysis. The data comes various mammography
registries in the United States of America.

The data was pre-processed and transformed to remove missing values and other
anomalies and converted into an suitable format for further processing that
was comprehensible and consistent with the data mining methods used on the dataset.
SMOTE was used as the first class balancing method to oversample the minority class
label, resulting in the creation of a separate dataset. The SpreadSubsample method was
used to apply undersampling as the second method of class balancing on the original
training dataset, and a new training dataset was created. Following that, the imbalanced
dataset was resampled using a combination of oversampling and undersampling
techniques, and a training dataset was created using this method. Data mining techniques
such as the Bayesian Network, Random Forest, Decision Tree (C4.5) were applied to the
dataset.The k-fold method (10 folds) was used to validate the model. The class-
imbalanced training data and the three class-balanced training sets were all cross-
validated.

The findings revealed that the Bayesian Network generated using the hybrid approach
from class balanced BCSC data had better overall output in terms of accuracy (99.1%),
ROC (0.937), sensitivity (78.1%), and False Positive rate (0%) or precision (100 percent ).
The contributions of this study include introducing class balancing methods on the BCSC
dataset and introducing Bayesian network, which is rarely explored in previous studies as
a method to achieve high accuracy.
PAGE ii
Further works for the study involve employing feature selection on the BCSC dataset and
seperating of variables with the same properties, developing a predictive model based on
feature selection and similar variables that can generate a generic model with less risk
factors to be used for prediction, and application of the technique to other data with
features including form, position, tumor size, or radiation intensity (Rajendran et al.,
2020).

1.11.6. Breast Cancer and Prostate Cancer Detection Using Classification


Algorithms
The aim of this research was to use machine learning techniques and cancer-specific
information to come up with hypotheses regarding gene patterns for early cancer detection
and prediction.The breast cancer dataset was obtained from Kaggle and it contains 569
instances, 357 of which are non-cancerous and 212 of which are cancerous. It has 32
attributes, including two class attribute labels (B= benign, M= malignant), an ID number,
and 30 real-valued attributes. These attributes were derived through digitising images of a
fine needle aspiration procedure performed on a breast tumor and are used to explain the
properties of the cell nuclei in the image.

Classification models such as Support Vector Machine (SVM), Decision Tree (DT),
Logistic Regression (LR), Random Forest Classifier (RF) and Naïve Bayes (NB).
Stratified K-folds was applied from 0 to 20 with a step size of 5. Dimensionality reduction
was performed by applying Principal Component Analysis (PCA).

After applying Stratified K-Fold, it was discovered that the Random Forest classification
model had the highest accuracy, with an accuracy of 96.05 percent. Dimensionality
reduction produced effective effects, mostly for Support vector Machine, where the
accuracy increased exponentially from 62.57 percent to 95.5 percent. However, owing to
the randomization of the train-test break, this resulted in a very small decrease in the
accuracy of Random Forest, resulting in an accuracy of 95%.

PAGE ii
The limitation of this study is that the application may fail to classify correctly because
the classification mark includes a large group number. The prospects of the study include
using a larger dataset, dealing with outliers and incomplete values, tuning algorithms,
grouping several models, and applying deep learning (Sreenivasa B C, 2020).

1.11.7. Development Of A Breast Cancer Risk Assessment Model Using A Machine


Learning Approach
The aim of this study was to use an advanced machine learning approach to create breast
cancer risk prediction models that outperformed the Gail model.This study used data from
a single breast cancer research facility. The data collection consisted of 393 patients with
a total of 129 variables after excluding unnecessary variables (i.e. respondent ID,
interview date, etc.). Demographics and pathologic results were included in the study (i.e.,
biopsy findings).

Three supervised machine learning algorithms were selected for this project: Logistic
regression, J48 Decision Tree and Random forest. Logistic regression (LR) is an
algorithm that resolves problems relating to binary classification by matching the data to a
logistic function to estimate the risk of an occurrence (in this case, developing invasive
cancer). The J48 Decision Tree (DT) is a rank-based model made up of decision priciples
that repeatedly separates independent variables into similar sections depending on the
input variables' most important splitter. Random forest (RF) is an algorithm that
generates, constructs, and then joins a series of classification trees to predict the
probability of breast cancer. The models Logistic regression and Random forest models
were selected because they are do not overfit easily, while Decision Tree model can
accommodate non-linear correlations between variables. The models were created using
Weka 3.8, an open source data mining platform. The calibration and discrimination
accuracy of each model were calculated to assess their efficiency.

There are two drawbacks to this research. To begin, the data for this analysis came from a
single research center. As a result, unless the six prediction models established in this
analysis are further validated with larger health data sets, they cannot be generalized.
PAGE ii
Second, the prediction models were created using data from primarily white women
(91.1%), posing a generalizability problem once more. Other races, such as African
Americans, could not be correctly estimated by the simulations. Prospects of the study
includes further evaluation of the models of data sets with a varied racial/ethnic mix (Choi
et al., 2020).

1.12. REVIEW OF RELATED FINDINGS


From the analysis carried out, there is a shortage of available datasets utilized in training
the machine learning models, most researchers made use of numerical data or expert
systems. Most of the research done also do not implement their findings on a working
system and hence cannot be evaluated. It is evident that a tool for diagnosing breast
cancer with a risk assessment option is needed especially in low and middle income
companies like Nigeria where pathologists and medical experts are not readily available in
order to give a second option in the diagnosis process. The tool will also make use of
convolutional neural networks in order to give a faster and more accurate diagnosis.

PAGE ii
CHAPTER THREE

SYSTEM ANALYSIS AND DESIGN

1.13. INTRODUCTION
This chapter gives an explanation of the methodologies and designs of the proposed
implementation of the breast cancer diagnosis system in depth. The analysis begins with a
comprehensive written description of the system's requirements, which is then used to
develop precise graphical models of the breast cancer diagnosis software application.

1.14. SYSTEM ANALYSIS


The systems development life cycle can be used to explain the process of creating
systems. Systems analysis is a phase of the system development life cycle that
encapsulates all the operations of a person (s) examining a system to assess, represent, and
select a logical replacement for an information system. (Felsenstein, 2003).

1.14.1. Analysis Of Existing Systems


 Design And Implementation Of A Fuzzy Expert System For Diagnosing
Breast Cancer
This system was created by Okikiola et al., in 2016. The major goal of the system
was to use fuzzy logic to create an expert system that diagnoses breast cancer.
The system had some limitations which are: using an expert system that is not as
accurate as diagnosing the breast cancer tissue samples first hand and not making
use of a breast cancer risk assessment tool for women who do not have the
cancerous cells.

 Algorithm For The Detection Of Breast Cancer In Digital Mammograms


Using Deep Learning

PAGE ii
This system aimed to diagnose breast mammography images using convolutional
neural networks to diagnose breast cancer. However a small dataset was used,
mammography images are unreliant in diagnosing breast cancer tissues in dense
breasts and the system lacked a user interface (Pirouzbakht & Mejía, 2017).

1.14.2. The Proposed System


Based on the setbacks of the previously analysed systems, MELIGNANT, a web based
breast cancer diagnosis application is proposed. The application allows pathologists to
input patient biopsy histopathological images and uses convolutional neural networks to
diagnose breast cancer with an accuracy of 93.8%. After diagnosis, pathologists can also
use the built-in risk assessment tool to evaluate the succeptibility of a breast cancer free
patient to develop breast cancer with an accuracy of 99%. The system also allows
pathologists to view the history of previously diagnosed patients and creates a database to
store patient data which can be used to bridge the gap of data storage in Nigeria. The
major goal of the system is to accurately diagnose breast cancer biopsy images so as to
provide a second opinion for pathologists and also substitute for a pathologist in cases
where there is a shortage of medical professionals.

1.15. REQUIREMENTS ANALYSIS


Understanding and meeting client demands from the beginning to the end of a software
development project has been identified as one of the most difficult issues among many
others. Misunderstanding client demands, incorrect elicitations, and assumptions have
major time, quality, and financial implications on projects. Furthermore, this is why
requirement management is a multidisciplinary activity that includes marketing, analysis,
engineering, product design, verification, and validation (Demirel & Das, 2018).

There are two types of requirements: functional and non-functional. In most cases,
functional requirement describes a behaviour from the perspective of the end-user. It
defines a user behaviour or activity that is not constrained by any system limitations and
is conducted by the system user. Non-functional requirements, define the outcome of user

PAGE ii
actions and is limited by the system's platform, environment, design restrictions, or
dependencies (Demirel & Das, 2018).

1.15.1. User Requirements


This section gives an overview of the proposed system requirements from the user’s
perspective.
 The user should be able to gain authoriazation to the system based on his/her
credentials i.e medical ID and password.
 The user should be able to input patients’ breast cancer histopathological images,
diagnose breast cancer tumors and view the diagnosis result.
 The user should be able to input patients’ breast cancer risk factors, perform risk
assessment test and view the risk assessment result.
 The user should be able to view a history of all previous diagnoses and risk
assessment tests.

1.15.2. Functional Requirements


 The user shall be able to signup to gain authorization to the diagnosis system.
 The user shall be able to login to gain authentication to the diagnosis system.
 The user shall be able to input breast histopathological image for breast cancer
diagnosis.
 The user shall be able to view the result of the breast cancer diagnosis.
 The user shall be able to input patients’ risk factors data for risk assessment test.
 The user shall be able to view the result of the risk assessment test.
 The user shall be able to view the history of past diagnoses and risk assessment
tests.
 The user shall be able to save patients’ data to a database.

1.15.3. Non-Functional Requirements


 The system should be user friendly.
 The system should have a short onboarding process without any delays.

PAGE ii
 The system should only authenticate users with a valid username and password.
 The system diagnosis and risk assessment should have a short response time.
 The system should update the database as required with no null fields.

1.16. SYSTEM ARCHITECTURE


System architecture is referred to as the basic characteristics as well as the patterns of
interactions, connections, limitations, and interconnections among the system components
and between the system and its surroundings. It is theoretical and based on the system's
goal and life cycle concepts. It also focuses on the high-level architecture of systems and
system elements. It addresses the concepts and characteristics of the architectural system
of interest.

System architecture activities aim to produce a holistic result based on interrelated and
uniform concepts, notions, and characteristics. The solution architecture provides
aspects that, to the maximum extent possible, address the issue or opportunity outlined by
a set of system requirements and product lifecycle ideas and can be implemented using
technologies.one of them being machine learning (Barrier, 2003).

1.16.1. Convolutional Neural Networks (CNNs) Architecture


CNNs are a type of Deep Neural Network that can discover and classify certain
characteristics in images, and are commonly employed for image analysis. They are used
for many operations, a few being, analysing medical images, detecting pictures and
videos, computer vision and natural language processing. A CNN will be used to train the
MELIGNANT diagnosis model. In CNNs, the most important component is the
convolutional layer. The convolutional layer acts as a filter between the input and output.
It generates an input feature map that summarizes the observed features. The
convolutional layer can break down an image into key features and then predict the
picture's label based on those characteristics. CNN architecture is divided into two
sections:
 Convolution tool: This section is responsible for feature extraction which is a
method of extracting and detecting the different features of a picture for analysis.
PAGE ii
 Fully connected layer: This uses the convolution process output to predict the
category of the image using the characteristics gathered previously.
These sections are shown below:

Figure 3.1: Convolutional Neural Network Architecture (Phung & Rhee, 2019)

The CNN is made of three layers which include the convolutional layers, pooling layers,
and fully-connected (FC) layers. The connection of these layers produces a CNN
architecture. Asides these layers, there are two required factors which are the dropout
layer and the activation function. These layers are explained below (Gurucharan, 2020):
 Convolutional Layer: This is the very first layer in a CNN architecture. It is
responsible for deriving distinct properties from the images passed in as an input.
This is the layer where the convolutional mathematical operation is performed by
taking the dot product between the filter and the input image sections, with respect
to the filter size, by sliding the filter across the input image (MxM). This process
produces a feature map, which includes all necessary details about the picture such
as its corners and edges. This feature map is passed to additional layers in order
for them to learn all the various distinct features from the input image.

PAGE ii
 Pooling Layer: The pooling layer usually comes after the convolutional layer. It
aims to computing cost by minimizing the convolved feature map size. To fufill its
aim, the pooling layer minimizes the connections between layers and works
autonomously on each feature map. Pooling procedures can be classified based on
the procedure followed. These are: max pooling where the most prominent
element is selected from the feature map, average pooling where the mean value of
the constituents in an established sized image is calculated and sum pooling in
which the total addition of all the elements in the predefined section is calculated.
The pooling layer serves as a link between the convolutional and fully connected
layers.

 Fully Connected (FC) Layer: This layer consists of the weights, biases and
neurons and acts as the connection between the neurons of two layers. The output
layer is preceded by the last various CNN architecture layers in which input
images are flattened and transmitted to the FC layer. Following this step, the
flattened vector is through a few extra FC levels, where the mathematical
functions operations are typically carried out and the classification procedure
begins.

 Dropout: This layer is used to prevent the machine learning model from only
performing well on its training data and failing to generalize on unseen data or in
more general terms overfitting. It accomplishes this by creating a small model
through the process of eliminating few neurons from the neural network. For
example, when a dropout of 0.3 is passed, 30% of the neural network nodes are
dropped at random.

 Activation Functions: Activation functions are used to study and approximate


continuous and complicated network variable-to-variable association. It is
responsible for determining which information from the model should fire or not
fire in the forward direction the network's end and adds non-linearity to the model.
The ReLU, Softmax, tanH, and Sigmoid functions are the most common activation
PAGE ii
functions and they all have a distinct use. The sigmoid and softmax functions are
usually recommended for a binary classification CNN model.

There are several CNN pretrained models that are used to solve image classification
problems rather than building a model from scratch. For the purpose of this training the
MELIGNANT breast cancer diagnosis model, the pre-trained model MobileNet is used.
MobileNet is a convolutional neural network for mobile vision applications that is simple,
efficient, and computationally light. Object identification, fine-grained classifications,
facial characteristics, and localisation are just a few of the real-world applications that
employ MobileNet. The features of MobileNet includes:
 Depth-wise seperable convolution: This is made up of the depth-wise layer and
the point-wise layer. In essence, the first layer filters the input channels, while the
second layer combines them to generate a new feature.

 The entire network structure: Below is architecture table of MobileNet

Figure 3 2: MobileNet Architecture (Niu et al., 2019)

PAGE ii
 MobileNet Parameters: Despite the fact that the fundamental MobileNet design is
tiny and computationally light, it contains two global hyperparameters that
successfully lessen the computing expenses which are the width multiplayer and
the resolution wise multiplayer.

1.16.2. Supervised Learning Architecture


This involves a training data which is a mathematical model that includes the inputs and
intended outputs where every matching input possesses a supervisory signal (output). The
system uses the provided training matrix to establish the association between the input
and output, and is then able to determine the output from consecutive inputs after training
via the established relationship. Based on the output criteria, supervised learning is
categorized as classification and regression analysis. When the outputs are constrained to
a set of values, classification analysis is used. The process of performing classification
analysis involve (Pedamkar, 2021):
 Data Acquisition: Also referred to as data preprocessing, data acquisition is the
first step defined in the architecture because machine learning is dependent on
obtainable data for decision making. Data collection, preparation and segregation
of case scenarios depending on definite properties required in the making
decisions, and data transmission for further classification to the processing unit are
all part of this process.

 Data Processing: The acquired data from the data acquisition layer is
subsequently transferred to the data processing layer for enhanced refining and
assimilation. This involves data normalization, cleaning, modification, and
ciphering. Data processing is impacted by the type of learning that is employed.
In supervised learning, the imput data is separated into numerous phases of
sample data necessary for training the system, and the resulting data is called a
training data.
PAGE ii
 Data Modelling: This layer entails the process of selecting several algorithms
that may condition the system to solve the machine learning problem. These
algorithms are either developed or derived from a collection of libraries.
The algorithm to be used for training the MELIGNANT breast cancer risk
assessment model is a Logistic Regression Classifier. Logistic Regression is a
predictive regression analysis used for binary classification. It describes the data a
nd the association between one output variable and one or more input variables.
The types of logistic regression includes: binary logistic regression which is used
when the output variable has two categories, multinominal logistic regression
which is when the output variable has more than two categories and Ordinal
Logistic Regression (OLS) which is used when the output variable category is
ranked.

 Execution: Experimentation, testing, and tweaking are all done at this phase in
machine learning. Improving the algorithm is the overall objective of this stageso
as to derive the required computation result and improve system efficiency. The
output of this step is a fine-tuned result capable of supplying required data for the
decision making process of the machine.

 Deployment: ML outputs must be operationalized or transmitted for further


evaluation. The result may be seen as a non-deterministic inquiry that is passed
through the decision-making procedure. It is important for the ML output to be
fluidly transmitted to production, enabling the machine to pass judgement
immediately based on the result and minimising the necessity for an extra
experimental process.
The processes are described in the diagram below:

PAGE ii
Figure 3.3: Supervised Learning Architecture (Pedamkar, 2021)

1.16.3. Web Framework Architecture


Web development frameworks are methods and tools used by software developers to
develop and support online applications, web services, and websites. They are made up of
three parts: Templating abilities for information display in a browser, a programming
environment for scripting information flow, and application programming interfaces
(APIs) for obtaining fundamental data resources. They also offers software developers
with the foundations and system-level services they need to create a Web content
management system (CMS) for controlling digital material. They are utilized by software
developers to design 'out-of-the-box' content management, user authentication, and
administrative tools. The web framework that will be used to implement the breast cancer
diagnosis system is Django.

Django, written in python programming language, is a popular and widely used non-
proprietry web framework. Django is built on the Model-View-Controller (MVC)
architecture, and can be categorized as:

PAGE ii
 The Model, which is represented by a database, is the logical data structure that
supports the program i.e MySql, Postgres.
 The View or user interface, is the part users see when they visit a website on the
browser. HTML/CSS/Javascript files are used to represent them.
 The Controller which acts as the connection between the view and the model, and
it is responsible for delivering data from the latter to the former.
Hence the software will concentrate on the model using MVC, by displaying or modifying
it.

1.17. SYSTEM DESIGN


Systems design is the procedure involved in setting up system features i.e modules,
architecture, components and interfaces, and system data depending on specified user
requirements. It involves the procedure of determinining, building, and depicting systems
that meet the aims and expectancy of an organization. It requires a structural technique for
a maximum system coordination and performance and a bottom-up or top-down strategy
to consider all the relevant system factors. Designers use modeling languages to describe
details and knowledge of the system organization, defined by a fixed rule set and
definitions. Design languages can either be graphical or textual. Examples include:
Unified Modelling Language (UML), Flowchart, Business Process Modelling Notation
and Systems Modelling Language (SysML).
UML is a graphical language used to picture, define, build, and record software-intensive
system artifacts. It produces an established approach for formulating system plans,
encompassing both theoretical and tangible elements like business and system operations,
and also programming language statements, database schemas, and reusable software
components (Gogolla, 2009). UML diagrams are divided into two categories:
 Structural Diagrams: This describes the system's static features or structure.
 Behavior Diagrams: This describes the system's dynamic characteristics or
behavior.
This section gives a detailed explanation of the graphical models that describe the
structure and behavior of the breast cancer diagnosis system. These diagrams are: Class
diagram, Use case diagram, sequence diagram.
PAGE ii
1.17.1. Class Diagram
This is a fixed structural diagram used in software engineering to depict system
organization by displaying the classes, attributes, methods, and association among objects
of the system. It is a fundamental component of object-oriented modeling and is used for
extensive and theoretical representation of the application framework and comprehensive
design that includes transforming the models to software program. The classes in a class
diagram indicate the essential and programmed parts of the application (Pilone & Pitman,
2005). The class diagram of MELIGNANT Breast Cancer Diagnosis System is shown
below

PAGE ii
Figure 3.4: Class Diagram of the Breast Cancer Diagnosis System

PAGE ii
1.17.2. Sequence Diagram
Also known as event diagrams or scenerios, this is a behavioral diagram that describes the
order and the manner that items in an application interact. This diagram provides an
understanding of the specified requirements of a new system or an existing activity to
software and business professionals. The sequence diagram of MELIGNANT Breast
Cancer Diagnosis System is shown below:

PAGE ii
Figure 3 5: Sequence Diagram of the Breast Cancer Diagnosis System
1.17.3. Use Case Diagram
This is the fundamental description of system or software requirements for an software
application that is yet to undergo development. It only defined the appropriate system
behavior and not the methodologies and can either be displayed in a textual or graphical
manner. It helps to explain a system from the users’ viewpoint and it describes system

PAGE ii
behavior to users by detailing every noticeable system process. The use case diagram of
MELIGNANT Breast Cancer Diagnosis System is shown below

Figure 3.6: Use Case Diagram of the Breast Cancer Diagnosis System

1.17.4. Activity Diagram


This is a behavioral diagram i.e it describes the system behavior. It describes the flow of
control of a system from start till finish and the various alternative decision pathways that
exist when completing a process. It also shows the subsequent and concurrent processing
of activities and are used in business and process modeling to represent the dynamic
PAGE ii
system properties. The activity diagram of MELIGNANT Breast Cancer Diagnosis
System is shown below

Figure 3.7: Activity Diagram of the Breast Cancer Diagnosis System

PAGE ii
1.18. DESCRIPTION OF TABLES
The breast cancer diagnosis system database design consists of some tables that are
described in the section below
1.18.1. User Creation Table
This is the table that stores the information of each pathologist that uses the application.
Table 3.1: User Creation Table

Field Description Data Type Null Key


first_name The pathologist’s first CharField No
name
last_name The pathologist’s last CharField No
name
email The pathologist’s email CharField No
address
username The unique medical CharField No
identifier of the user
password The password of the user CharField No

1.18.2. Patient Diagnosis Data Table


This is the table that stores the information of each patient that has been diagnosed via the
application.
Table 3.2: Table of the Patient Diagnosis Data

Field Description Data Type Null Key


patient_name The full name of the patient. CharField No
patient_id The unique ID of the patient. CharField No PRIMARY
hist_image The diagnosed CharField No
histopathological image.
diagnosis_resul The result of the diagnosis. CharField No
t
pathologist The pathologist that performed CharField No
the diagnosis.

PAGE ii
1.18.3. Patient Risk Assessment Data Table
This is the table that stores the information of each patient that has been that has taken a
risk assessment test via the application.
Table 3.3: Table of the Patient Risk Assessment Data

Field Description Data Type Null Key


patient_id The unique ID of the patient. CharField No FOREIGN
menopause The menopausal status of the CharField No
patient.
age The age group of the patient CharField No
density The BI-RADS breast density of the CharField No
patient
bmi The body mass index of the patient CharField No
agefirst The patient’s age at first birth CharField No
nrelbc The number of first degree relatives CharField No
of the patient that have breast cancer
brstproc Previous breast procedure of the CharField No
patient
lastmamm The result of the patients’ last CharField No
mammography
surgmeno Surgical or natural menopause CharField No
hrt Use of hormone replacement CharField No
therapy
invasive History of invasive breast cancer in CharField No
the patient
risk_result The result of the risk assessment test CharField No
pathologist The pathologist that performed the CharField No
test

1.18.4.

PAGE ii
CHAPTER FOUR

SYSTEM IMPLEMENTATION

1.19. INTRODUCTION
This chapter discusses the hardware and software requirements, user interface modules,
data analysis and processing outcomes, programming languages, frameworks, and
libraries used to accomplish the project's goals. The breast cancer diagnosis system's
deployment would be described and demonstrated. The system was created with the intent
to maximise user experience, understanding, simplicity of use, and predictive accuracy in
mind.

1.20. SYSTEM REQUIREMENTS


System requirements, also known as minimum system requirements, are the
configurations a system must possess so that a hardware or software program can function
effectively. If the specifications are not fulfilled, there may be installation issues
(prohibition of software installation) or system inefiiciency (software crash).

1.20.1. Hardware Requirements


These are necessary physical computer resources or hardware established by an operating
system or software application to run a system. The minimum hardware requirements for
running the MELIGNANT Breast Cancer Diagnosis System include:
Table 4.1: Hardware Requirements

Requirement Hardware
Processor Intel Pentium II 2.5GHz or higher
Primary Memory 4GB or higher
Architecture 64Bit (X64)
Microscope Digital or Stereo Microscope
USB USB Microscope

PAGE ii
Secondary Storage 32GB HDD or higher
1.20.2. Software Requirements
Software requirements define the software resource needs and necessities that must be
preinstalled in order for a program to work properly. The minimum software requirements
for running the MELIGNANT Breast Cancer Diagnosis System include:
Table 4.2: Software Requirements

Requirement Software
Operating System I. MAC: OS X v10.7 or higher
II. Windows: 7 or newer
III. Linux: Ubuntu
IV. iOS : v5 or higher
V. Android : v2.1 or higher
Programming Language Python, HTML, CSS, Bootstrap
Development Tool Visual Studio Code, Jupyter Notebook
Database Management System MySQL
Web Framework Django
Supported Browsers I. Chrome 4.0 and above
II. Safari 5.0 and above

1.21. IMPLEMENTATION PROCESS


This is an overview of the entire process involved in the development of the
MELIGNANT breast cancer diagnosis software.

1.21.1. Implementation Tools Used


 Python: This a general-purpose programming language i.e it can be used for
various types of software development. It can be used for developing the or
server-side of web and mobile applications, mobile application development, big
data processing, solving mathematical problems and system scripting. Python has
some useful libraries for machine learning. These include:

PAGE ii
 Numpy: For scientific computing, NumPy is the most significant Python
package. It provides a multidimensional array object, derivative objects,
and a number of array-related functions.
 Tensorflow: This is a non-propriety machine learning platform. It allows
intellectuals and developers construct and launch machine learning
applications with a vast and versatile range of packages, libraries and
community resources.
 Matplotlib: This is a charting library which embeds charts in applications
using an object-oriented API.
 Scikit-Learn: This is a popular machine learning package. Scikit-learn
consists of machine learning tools that includes mathematical, analytical,
and general-purpose algorithms that serve as the foundation for a variety
of machine learning technologies.
 Pandas: This is a quick, powerful, versatile, and simple to use non-
propriety tool used for analyzing and manipulating data. It has functions
and data structures for handling numerical tables and time series.
 Seaborn: This is a module for developing statistical visualizations built
on matplotlib. It interacts with pandas data structures and assists people
with data exploration and comprehension. The seaborn charting operations
rdoes the required semantic mapping and statistical aggregation to produce
useful charts and runs with dataframes and arrays that comprise of whole
datasets.

 JupyterLab: This is an interactive development platform for Jupyter notebooks,


programs and data that allows users to adjust and structure its interface to
accommodate data science, scientific computing, and machine learning
workflows hence it is very flexible. It is modular and expandable, and allows
users to develop plugins which introduce additional attributes and connect with
recent atrributes.

PAGE ii
 Visual Studio Code: Also known as VS Code, this is a non-propriety text editor.
VS Code is one of the most popular integrated development editor as it has
various prominent features, despite its low weight. It works with a variety of
programming languages and it allows users to add and develop new extensions i.e
code linters, debuggers, and support for cloud and web development.

 HTML and CSS: These are the most widely used web development
technologies. HTML is used for page organization while CSS improves the
visuals of the page layout. HTML, CSS, images and programming, are the
fundamentals for creating web applications.

 Bootstrap: This is a free, non-propriety CSS framework for responsive, mobile-


first front-end web development. It has design themes for user interface elements
that are depend on CSS and JavaScript.

 Django: Django is a free, non-propriety web application framework built on


Python programming language. It makes use of the model–template–views
(MTV) architectural paradigm. The major aim is to ensure that building complex,
database based websites is made easier. It emphasizes component reusability,
minimal code, reduced coupling, and quick development.

1.21.2. Diagnosis Model Training Process


The Breast Cancer Histopathological Image Classification (BreakHis) dataset was used to
train the breast cancer diagnosis model. It consists of 9,109 microscopic pictures of breast
tumor tissue obtained from eighty-two (82) people and magnified at various
magnifications (40X, 100X, 200X, and 400X). There are two thousand, four hundred and
eighty (2,480) benign and five thousand, four hundred and twenty nine (5,429) malignant
specimens in the database. The dataset was created with the help of Parana, Brazil's P&D
Laboratory - Pathological Anatomy and Cytopathology. There are two types of tumors in
the BreaKHis dataset: benign and malignant tumors. A benign tumor is one that does not
meet any of the criteria for malignancy, such as significant cellular atypia, mitosis,
PAGE ii
basement membrane rupture, metastasis, and so on. In most cases, benign tumours are
generally harmless, developing slowly and remaining confined. The term "malignant
tumour" refers to a tumour that can infiltrate and damage nearby structures (locally
invasive) as well as spread to distant places (metastasise) and cause mortality. The SOB
technique, also known as partial mastectomy or excisional biopsy, was used to obtain
samples in the current version of the dataset. When compared to other types of needle
biopsy, this treatment takes a bigger tissue sample and is done under anaesthesia in a
hospital. Breast tumors, both benign and malignant, are classified into distinct kinds
depending on how their cells appear under a microscope. Breast tumors come in many
distinct kinds and subtypes, each with its own prognosis and treatment options. There are
currently four histologically different types of benign breast tumors in the dataset and four
malignant. The biopsy technique , tumor class, tumor type, patient identity, and
magnification factor are all stored in each image filename. For the purpose of this study,
the dataset was split into training, testing and validating sets.

The first step in building the diagnosis model was importing the necessary libraries. These
include tensorflow library with tools such as keras for training convolutional neural
network models and matplotlib for visualizing the model’s metrics.

Figure 4.1: Importing Necessary Libraries

The next step was to load the dataset and preprocess the images. This was done using
keras imagedatagenerator for image data augumentation. Image data augmentation is
the process of enlargening the train data size through artificial means by generating
altered varieties of the images in the dataset. The ImageDataGenerator class in Keras

PAGE ii
supports a variety of augmentation strategies and is a fast and simple way to augment
images. The images were preprocessed using pretrained model Mobilenet CNN
preprocessing method, resized and assigned a batch size of ten (10).

Figure 4.2: Loading and preprocessing the image data

To visualize the preprocessed images, matplotlib was used to plot a single batch of the
images.

Figure 4.3: Visualising preprocessed images

PAGE ii
For model architecture definition, the CNN pretrained model, MobileNet was imported
and modified to suit the purpose of classifying breast cancer histopathological images
accurately, and the summary() method was called to view the model architecture.

Figure 4.4: Building the model and viewing the summary

To configure the learning process, the compile() method was called with an Adam
optimizer, a loss function of categorical crossentropy and set the metrics of model
evaluation to accuracy. The ReduceLROn Plateau method was used to reduce the learning
rate of the model in order to avoid model convergence and weights with the highest
validation accuracy were saved to a hdf5 file for future evaluation. The model was trained
with ten (10) epochs and gave a training accuracy of 98.7% and a validation accuracy of
94.2%.

Figure 4.5: Compiling the model

PAGE ii
Figure 4.6: Training the diagnosis model
The learning curve of the model training accuracy against validation accuracy and training
loss against validation loss was plotted.

PAGE ii
Figure 4.7: Accuracy and loss graphs against training and validation data
The model was evaluated again with the test dataset and a confusion matrix was plotted.
On the test set, the model had an accuracy of 93.8%.

PAGE ii
Figure 4.8: Confusion matrix of the Breast Cancer Diagnosis model on the test data

Next the model was saved to be used in the web application using the keras model.save()
method. Keras saves the model and all trackable objects such as the layers and variables
linked to it using the save() method. The optimizer, weights, and model settings are also
preserved. Additionally the settings and information for each Keras layer linked to the
model are preserved.

Figure 4.9: Saving the diagnosis model

PAGE ii
Figure 4.10: Embedding the diagnosis model

1.21.3. Risk Assessment Model Training Process


The dataset used to train the breast cancer risk assessment model was the Breast Cancer
Surveillance Consortium (BCSC) Risk Estimation dataset. The data was collected from
various breast cancer mammography registries in the United States of America. The
dataset contain the screening mammograms of 280,660 women of which 271355 women
had breast cancer and 9305 women did not. Variables in the dataset include: menopausal
status, agegrp, density, bmi, agefirst, nrelbc, brstproc, lastmamm, surgmeno, hrt, invasive,
cancer, training and count. The information on the factors of interest was obtained
through questionnaires provided to women during their mammography appointments and
via a radiologist that evaluated the mammogram findings at the screening facility. Aside
from that, cancer data and pathology registry data were combined with mammography
data, to add to the breast cancer-related factors. To protect the privacy of patients,
mammography registries, and radiology institutions, the data was anonymised.
PAGE ii
The first step in building the risk assessment model was importing the necessary libraries.
These include pandas, numpy, seaborn and matplotlib for data analysis and exploration
and scikit-learn for model training. The next step was to import the dataset and check for
null values in order to give insight for data preprocessing.

Figure 4.11: Checking for null values

The Pandas describe() method was called in order to know the fundamental statistics of
the dataset. The derived statistics include count, mean, standard deviation, minimum
value, maximum value, 25th percentile, 50th percentile and 75th percentile.

Figure 4.12: Viewing the basic statistics of the data


The next step was to change the datatype into categorical in order to group the data and
int32 to reduce the space taken by the data.

The next step was to visualize the data using matplotlib and seaborn. This is done to
visualize the relationship between features of the datasets and draw conclusions. First we
visualize the distribution of the columns of the dataset and plot a histogram which is seen
below.

PAGE ii
Figure 4.13: Distribution of the dataset columns

Next we visualize the importance of the target feature (cancer) to the dataset and plot a
histogram. From this visualization we can see that the dataset is highly unbalanced with a
vast majority of the patients not having breast cancer.

Figure 4.14: Importance of the target data to the dataset


Next we visualize the distribution of each age group in the dataset and plot a histogram

PAGE ii
Figure 4.15: Distribution of each age group in the dataset
Next, the training and count fetures were dropped because they were not risk factors and
therefore insignificant to the study and a correlation matrix was plotted to discover which
features were highly correlated with the target feature. After plotting, it was discovered
that a history of breast cancer had a high correlation with breast cancer diagnosis.

Figure 4.16: Correlation matrix of the dataset

PAGE ii
Next the data was split into training and testing sets using 75% for training the model and
25% for testing the trained model on unfamiliar data. The dataset was stratified to
maintain proportionality and improve accuracy.

Figure 4.17: Splitting the dataset into train and test


Due to the unbalanced state of the dataset, class balancing was carried out by
oversampling the minority class by 78%. This was done by using Synthetic Minority
Oversampling Technique (SMOTE). This works by selecting cases in the feature space
that are near each other, drawing a line in the feature space between the selected cases,
and drawing a new sample at a location along that line.

Figure 4.18: Balancing the classes of the dataset


A Logistic Regresssion model was used to train the model. After training the model is
tested with the test dataset to ensure that it fit properly.

Figure 4.19: Classification metrics of the risk assessment model


Next the model was saved using the pickle library in python. Pickle is the established way
of serializing Python objects. The pickle operation can be used the to serialize machine
learning algorithms and save the serialized format to a file.

PAGE ii
Figure 4.20: Pickling the Risk Assessment Model

Figure 4.21: Unpickling the Risk Assessment model


1.21.4. System Evaluation
The system was evaluated using indigenous data from Clinix Hospitals which consisted of
9 samples of which 3 were benign and 6 were malignant. The system was able to diagnose
7 out of 9 samples correctly with an accuracy of 77.8%.

Figure 4. 22: System Evaluation Metrics

PAGE ii
1.22. PROGRAM MODULES AND INTERFACES
1.22.1. Landing Page Module
The Landing page is the first page interacted with on using the application. It welcomes
users to the application, provides information about the application and gives helpful links
to navigate to the login, signup or about us page.

Figure 4.23: Landing Page of the Breast Cancer Diagnosis System

Figure 4.24: Landing Page of the Breast Cancer Diagnosis System

PAGE ii
1.22.2. Signup Page Module
Users are required to register to the application by creating an account before performing
diagnosis. This page consists of the user’s first name, last name, Medical ID, email
address, password and confirmation of password. The user is notified if the field does not
match the intended format, is required and if the password field is too common or does
not match the password confirmation field. The user is registered to the application and
redirected to the login page if all these requirements are met.

Figure 4.25: Signup Page of the Breast Cancer Diagnosis System

1.22.3. Login Page Module


Already existing users can login to the application with their Medical ID and password.
When the user inputs his/her Medical ID and password, user validation is performed by
checking the database if such details exist. If the details do not exist, the user is notified to
recheck details or create an account. If the user inputs the correct details then he/she is
redirected to the home-page

PAGE ii
Figure 4.26: Login Page of the Breast Cancer Diagnosis System

1.22.4. Home Page Module


This page features a navigation bar that welcomes the user and gives the option to log out
of the system. The page also gives the users the option of diagnosing breast cancer biopsy
images or performing a breast cancer risk assessment test.

Figure 4.27: Home Page of the Breast Cancer Diagnosis System


PAGE ii
1.22.5. Diagnosis Page Module
This is the first essential page of the applications where diagnosis is carried out. The user
is required to enter in the patient’s details i.e full name and patient ID and each of the
fields are required and have assigned error messages to them. The user can then upload
only png and jpeg image files to be used for diagnosis. On clicking the view result button,
the details are saved to the database and the user is redirected to the result page. If the
result is benign the user can either take a risk assessment test or test again. The user can
also view a history of all the past diagnoses he/she has made with their corresponding
results.

Figure 4.28: Perform Diagnosis Form

PAGE ii
Figure 4.29: View Diagnosis Results Page

Figure 4.30: Diagnosis Results History

PAGE ii
1.22.6. Risk Assessment Page
This is the second essential page of the application containing all details required to
predict breast cancer risk. All the fields are required and assigned error messages in case
of issues. The user is required to select or input Patient ID. The user then selects the
patient’s menopausal status, current age group, bi-rads breast density result, body mass
index, age at first birth, history of relatives with breast cancer, previous breast procedure,
result of last mammogram, natural or surgical menopausal status, use of hormone
replacement therapy and history of breast cancer. Based on the selections, the user can
then view patient’s risk assessment results. The user can also view a history of all past
risk assessment tests and their corresponding results.

Figure 4.31: Risk Assessment Result Form

PAGE ii
Figure 4.32: Risk Assessment Form

Figure 4.33: Risk Assessment Result

PAGE ii
Figure 4.34: Risk Assessment History

PAGE ii
CHAPTER FIVE

SUMMARY, RECOMMENDATION, AND CONCLUSION

1.23. INTRODUCTION
This chapter details an overall overview of the project and future works that can be
considered to close the gaps available in the project.

1.24. SUMMARY
This project was able to proffer machine learning techniques as a means of providing a
second opinion to pathologists and eventually substituting for pathologists in
environments that none exist. In the course of the study, it was discovered that MobileNet
CNN architecture was able to classify malignant and benign tissues with an accuracy of
94.2% and Logistic Regression classifier was able to predict breast cancer risk with an
accuracy of 99%. It was also discovered that a history of breast cancer, breast density,
body mass index, age of women at first birth and the number of relatives with breast
cancer are the most important factors for developing breast cancer. With the high rate at
which breast cancer is spreading it is important to consider those factors and treat women
that fall into those categories as high priority. These models were then embedded into a
Django interface in order to enable pathologist to input patient’s histopathological image
from a microscope and risk factors data from their case files to ensure proper diagnosis
and prevention steps are taken.

1.25. RECOMMENDATION
Although this project was able to achieve a high accuracy in diagnosing and predicting
breast cancer, there are some factors that need to be improved due to the limitations and
challenges faced while developing the system. Future work on this project include:
 Using annotated images by a pathologist for retraining the diagnosis model so that
the system can better discern between malignant and benign tissue cells.

PAGE ii
 Retraining the risk assessment model with more samples of the cancer positive
samples in order to improve the precision and recall of the risk assessment model
on the cancer positive class.
 Using the accumulated histopathological image dataset from the system for
analysis on indigenous data in order to detect outliers or anomalies and retraining
the model on thst dataset.

1.26. CONCLUSION
Early diagnosis of breast cancer is very important as it improves patient survivability and
prognosis. Regular risk assessment tests are necessary to ensure preventability of breast
cancer. This project has been able to effectively develop a user friendly system to
effectively diagnose and predict breast cancer. It has also created a database that can be
used to store subsequent indigenous data for further research. Proper diagnosis of breast
cancer in Nigeria is a very important issue as the disease is widespread and the country
lacks the necessary facilities and labor for handling it. As a result, this system will prove
to be very essential and valuable in the health sector.

PAGE ii
REFERENCES
Agency Report. (2019) It will take 25 years to reduce doctors’ shortage in Nigeria –
NMA President. Premium Times. Retrieved from
https://www.premiumtimesng.com/health/health-news/367620-it-will-take-25-years-
to-reduce-doctors-shortage-in-nigeria-nma-president.html
Agodirin, O., Olatoke, S., Rahman, G., Olaogun, J., Kolawole, O., Agboola, J., …
Fatudimu, O. (2019). Impact of primary care delay on progression of breast cancer in
a black african population: A multicentered survey. Journal of Cancer Epidemiology,
2019, 1–10. https://doi.org/10.1155/2019/2407138
Ake, A. (2018). Battling Cancer Misdiagnosis. This Day. Retrieved from
https://www.thisdaylive.com/index.php/2018/08/30/battling-cancer-misdiagnosis/
Akinnuwesi, B. A., Macaulay, B. O., & Aribisala, B. S. (2020). Breast cancer risk
assessment and early diagnosis using Principal Component Analysis and support
vector machine techniques. Informatics in Medicine Unlocked, 21, 100459.
https://doi.org/10.1016/j.imu.2020.100459
Al-Quraishi, T., Abawajy, J., Chowdhury, M. U., Rajasegarar, S., & Abdalrada, A. S.
(2017). Breast cancer risk assessment prediction using an ensemble classifier. 30th
International Conference on Computer Applications in Industry and Engineering,
CAINE 2017, February 2018, 177–183.
American Cancer Society. (2020). Breast cancer risk and prevention. Cancer.Org, 1–45.
Anand, P., Kunnumakara, A. B., Sundaram, C., Harikumar, K. B., Tharakan, S. T., Lai, O.
S., Sung, B., & Aggarwal, B. B. (2008). Cancer is a preventable disease that requires
major lifestyle changes. Pharmaceutical Research, 25(9), 2097–2116.
https://doi.org/10.1007/s11095-008-9661-9
Anthonia Obokoh. (2019). How lack of data, research hinders Nigeria healthcare system.
Business Day. Retrieved from https://businessday.ng/health/article/how-lack-of-data-
research-hinders-nigeria-healthcare-system/
Barrier, T. (2003). Encyclopedia of information systems (1st ed.). Michigan: Academic
Press
Burrows, W. & Scarpelli, . Dante G. (2020). Disease. Encyclopedia Britannica. Retrieved
from https://www.britannica.com/science/disease
PAGE ii
CTCA. (2019). What’s the difference? Male breast cancer and female breast cancer.
Retrieved from https://www.cancercenter.com/community/blog/2019/07/whats-the-
difference-female-male-breast-cancer
Chioma, O. (2020). How misdiagnosis kills 70 percent of Nigeria’s cancer patients –
Expert. Vanguard Nigeria. Retrieved from
https://www.vanguardngr.com/2019/03/how-misdiagnosis-kills-70-percent-of-
nigerias-cancer-patients-expert/
Choi, J., Jung, H.-T., & Choi, W. J. (2020). Development of a breast cancer risk
Assessment model using a machine learning approach. 24(2), Retrieved from
https://www.mendeley.com/catalogue/bc719895-bf9b-34ef-a82c-d42288ffd7e8/?
utm_source=desktop&utm_medium=1.19.8&utm_campaign=open_catalog&userDo
cumentId=%7Bc081cf63-d152-3b6c-aff5-dd58b20cb0a7%7D
College of Nigerian Pathologists. (2020). Only 500 pathologists in Nigeria ’ s healthcare
sector – CNP. Retrieved from https://theeagleonline.com.ng/only-500-pathologists-
in-nigerias-healthcare-sector-cnp/
Demirel, S., & Das, R. (2018). Software requirement analysis: research challenges and
technical approaches. 6th International Symposium on Digital Forensic and Security
(ISDFS). https://doi.org/10.1109/ISDFS.2018.8355322
Emilia, J.-A. (2017). Breast cancer in sub-Saharan Africa : determinants of stage at
diagnosis and diagnostic delays in women with symptomatic breast cancer. London
School of Hygiene and Tropical Medicine, 10(17).
Felsenstein, D. (2003). Encyclopedia of information systems (1st ed.). Michigan:
Academic Press
Ferlay, J., Colombet, M., Soerjomataram, I., Mathers, C., Parkin, D. M., Piñeros, M.,
Znaor, A., & Bray, F. (2019). Estimating the global cancer incidence and mortality in
2018: GLOBOCAN sources and methods. International Journal of Cancer (Vol.
144, Issue 8, pp. 1941–1953). Wiley-Liss Inc. https://doi.org/10.1002/ijc.31937
Gogolla, M. (2009). Unified modeling language. Encyclopedia of Database Systems,
3232–3239. https://doi.org/10.1007/978-0-387-39940-9_440
Gurucharan, M. (2020). Basic CNN Architecture: Explaining 5 Layers of Convolutional
Neural Network. Retrieved from https://www.upgrad.com/blog/basic-cnn-
PAGE ii
architecture/
MacIntyre, J. (2014). Predictive Analytics Using Machine Learning. AI Journal.
Krishna, M., Neelima, M., Harshali, M., & Rao, M. V. G. (2018). Image classification
using Deep learning. International Journal of Engineering and Technology(UAE),
7(March), 614–617. https://doi.org/10.14419/ijet.v7i2.7.10892
Mayo Clinic. (2021). Breast Cancer - Symptoms and Causes. Retrieved from
https://www.mayoclinic.org/diseases-conditions/breast-cancer/symptoms-causes/syc-
20352470
Ming, C., Viassolo, V., Probst-Hensch, N., Chappuis, P. O., Dinov, I. D., & Katapodi, M.
C. (2019). Machine learning techniques for personalized breast cancer risk
prediction: comparison with the BCRAT and BOADICEA models. Breast Cancer
Research, 21(1), 75. https://doi.org/10.1186/s13058-019-1158-4
Nahid, A.-A., & Kong, Y. (2017). Involvement of machine learning for breast cancer
image classification: a survey. Computational and Mathematical Methods in
Medicine https://doi.org/10.1155/2017/3781951
Niu, Q., Teng, Y., & Chen, L. (2019). Design of gesture recognition system based on
Deep Learning. Journal of Physics: Conference Series, 1168(3).
https://doi.org/10.1088/1742-6596/1168/3/032082
Okikiola, F. M., Aigbokhan, E. ., Mustapha, A. M., Onadokun, I. O., & Akinade, O. A.
(2016). Design and implementation of a fuzzy expert system for diagnosing breast
cancer.
Olatunji, T., Sowunmi, A., Ketiku, K., & Campbell, O. (2019). Sociodemographic
correlates and management of breast cancer in Radiotherapy Department, Lagos
University Teaching Hospital: A 10-year review. Journal of Clinical Sciences, 16(4),
111. https://doi.org/10.4103/jcls.jcls_82_18
Omondiagbe, D. A., Veeramani, S., & Sidhu, A. S. (2019). Machine learning
classification techniques for breast cancer diagnosis . IOP Conf. Series: Materials
Science and Engineering . https://doi.org/10.1088/1757-899X/495/1/012033
Pedamkar, P. (2021). Machine learning architecture: process and types of machine
learning. EDUCBA. Retrieved from https://www.educba.com/machine-learning-
architecture/
PAGE ii
Phung, V. H., & Rhee, E. J. (2019). A high-accuracy model average ensemble of
convolutional neural networks for classification of cloud image patches on small
datasets. Applied Sciences (Switzerland), 9(21). https://doi.org/10.3390/APP9214500
Pilone, D., & Pitman, N. (2005). UML 2.0 in a Nutshell.
Pirouzbakht, N., & Mejía, J. (2017). Algorithm for the detection of breast cancer in digital
mammograms using deep learning. CEUR Workshop Proceedings, 2031, 46–49.
Rajendran, K., Jayabalan, M., & Thiruchelvam, V. (2020). Predicting breast cancer via
supervised machine learning methods on class imbalanced data. International
Journal of Advanced Computer Science and Applications (Vol. 11, Issue 8).
www.ijacsa.thesai.org
Silaparasetty N. (2020) An Overview of Machine Learning. In: Machine Learning
Concepts with Python and the Jupyter Notebook Environment. Apress, Berkeley,
CA. https://doi.org/10.1007/978-1-4842-5967-2_2
Spanhol, F. A., Oliveira, L. S., Petitjean, C., & Heutte, L. (2016). Breast cancer
histopathological image classification using Convolutional Neural Networks.
Proceedings of the International Joint Conference on Neural Networks, February
2018, 2560–2567. https://doi.org/10.1109/IJCNN.2016.7727519
Sreenivasa B C. (2020). Breast cancer and prostate cancer detection using classification
algorithms. International Journal of Engineering Research And, V9(06).
https://doi.org/10.17577/ijertv9is060085
Triantafyllidis, A. K., & Tsanas, A. (2019). Applications of machine learning in real-life
digital health interventions: Review of the literature. Journal of Medical Internet
Research, 21(4), 1–9. https://doi.org/10.2196/12286
University of California San Francisco. (2020). Breast Cancer Risk Factors.
W.H.O. (2008). Constitution of World Health Organization. (Vol. 44, Issue 3).
https://doi.org/10.2307/2004468
WHO. (2021a). Cancer. Retrieved from
https://www.who.int/news-room/fact-sheets/detail/noncommunicable-diseases
WHO. (2021b). Noncommunicable diseases. Retrieved from https://www.who.int/news-
room/fact-sheets/detail/noncommunicable-diseases

PAGE ii

You might also like