Mini Project Documentation All

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

An

Industrial Mini
Project Report
on
CONTACT TRACING USING ML
A report submitted in partial fulfilment of the requirements for the award of the
B.Tech degree
By

D. Sanjana
Reddy
(20EG105519)

N. Varshitha
(20EG105528)

M. Shemith
Reddy
(20EG105530)

Under The Guidance Of

M. Sandhya Rani

Assistant Professor, CSE

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


ANURAG UNIVERSITY
VENKATAPUR– 500088
TELANGANA
YEAR 2023-2024

I
DECLARATION

I h ere declare that the Report entitled “CONTACT TRACING USING


ML” submitted for the award of Bachelor of technology Degree is my own work and
the Report has not formed the basis for the award of any degree , diploma, associate
ship or fellowship of similar other titles. It has not been submitted to any other
University or Institution for the award of any degree or diploma

Place: Anurag University, Hyderabad D. Sanjana Reddy (20EG105519)

N. Varshitha (20EG105528)

M. Shemith Reddy (20EG105530)

II
CERTIFICATE
This is to certify that the Report / dissertation entitled “CONTACT TRACING
USING ML” that is being submitted by Ms. Sanjana Duggireddy, Mr. Shemith Reddy
and Ms. Varshitha Naddunuri in partial fulfilment for the award of B.Tech in
Computer Science and Engineering to the Anurag University is a record of bonafide
work carried out by his/her under our guidance and supervision.

The results embodied in this Report have not been submitted to any other
university or Institute for the award of any degree or diploma

Dean Signature of The Supervisor


M. Sandhya Rani
(Assistant Professor)

III
ACKNOWLEDGMENT

We would like to express our sincere thanks and deep sense of gratitude to
project supervisor M. Sandhya Rani for his constant encouragement and inspiring
guidance without which this project could not have been completed. Her critical
reviews and constructive comments improved our grasp of the subject and led to the
fruitful completion of the work. Her patience, guidance and encouragement made
this project possible.

We would like to express my special thanks to Dr. V. Vijaya Kumar, Dean


School of Engineering, Anurag University, for their encouragement and timely
support in our B.Tech program.

We would like to acknowledge our sincere gratitude for the support extended
by Dr. G. Vishnu Murthy, Dean, Dept. of CSE, Anurag University. We also express
my deep sense of gratitude to Dr. V V S S S Balaram, Academic coordinator. Dr.
Pallam Ravi Project Coordinator and Project review committee members, whose
research expertise and commitment to the highest standards continuously motivated
me during the crucial stage of our project work.

IV
ABSTRACT

At the time of Corona, we are undergoing a dangerous situation due to wide spread of “Corona Virus”. If a
person is found to be infected with the Corona Virus, then it is critical to find the person who got contacted
with the infected person. This can be done by Manual methods or numerical methods.
Technology can be used for Contact tracing is a process that produces more efficient and accurate results than
when the procedure is done manually.
Contact Tracing is one type of process to identify infected people using machine learning. It can help to reduce
the number of infected persons and can fasten the process of treating people who have been infected with a
disease. We can save a lot of lives this way.
To find the movement of an infected person, we use GPS data set that contains time and location of a person.
The location data contains longitude and latitude coordinates.

VI
LIST OF FIGURES

Figure No. Figure Name Page No


1.1 Contact Tracing 1
1.2 Sample of data set 4
1.3 GPS Tracking 7
3.1 Use Case Diagram 9
3.2 Activity Diagram 10
3.3 Class Diagram 11
3.4 Sequence Diagram 12
3.5 System Architecture 13
4.1 Seaborn Scatterplot 16
4.2 DBSCAN Clustering 18
4.3 Data Visualization 1 20
4.4 Data Visualization 2 21
4.5 Data Visualization 3 22
4.6 Data Visualization 4 24

VII
INDEX
S.No. CONTENT Page No.

1. Introduction 1
1.1 Rationale 2
1.2 Goal 2
1.3 Objective 2
1.4 Methodology 2
1.5 Roles and Responsibilities 4
1.6 Contribution of Project 5
1.6.1 Market Potential 6
6
1.6.2 Innovativeness 6
1.6.3 Usefulness 7
1.7 Usefulness

2. Requirement Engineering 7
2.1 Functional Requirement 7
2.1.1 Interface Requirement 7
2.2 Non Functional Requirement 8

3. Analysis and Design 8


3.1 Use case Diagrams 8
3.2 Activity Diagrams 9
3.3 Class Diagram 10
3.4 Sequence Diagram 11
3.5 System Architecture 12
4. Construction 13
4.1 Implementation 13
4.2 Implementation Details 14
4.2.1 DBSCAN Algorithm 15
4.2.2 Data Visualization 16
4.3 Software Details 17
4.4 Hardware Details 18
4.5 Testing 19
4.5.1 White Box Testing 26
4.5.2 Black Box Testing 26

5. Conclusion and Future Scope 29


5.1 Conclusion 29
5.2 Future Scope 29
5.3 Expected Outcome 29

6. References 30

VIII
1. INTRODUCTION

There are lot of infectious viruses which can be spread from one person to another person easily.
Coronavirus is one of the viruses among them. If a person tests positive for Covid-19, it's critical to track down
the other persons who were exposed to the virus through the sick person.
We track the activities of patients who have received treatment in the last 14 days to detect contaminated
people. Contact tracking is the term used to describe this process. This can be accomplished using either
manual or numerical methods.
In the process of contact tracing, technology can be employed, and it is more efficient and accurate than
manual techniques.
The Technology that is used in the process was Clustering, which is a subclass of Machine Learning. It is used
to stop the spread of viruses and even save more lives. This can be done by collecting data sets of different
people at different locations with their timestamps.
Our data set contains 4columns such as:
1: Name of a person
2: Timestamp
3: Latitude.
4: Longitude
Applying DBSCAN Algorithm results in the formation of clusters. We analyze the data using the help of
graph. The piece of python code which returns the contacted persons.
Here, we use Jupyter Notebook for analyzing and running the code. The installation of Jupyter Notebook was
simple and easily downloaded from the Internet.
Jupyter comes with the anaconda package. We can easily install different libraries in the Anaconda command
prompt, and this makes our project simple

Figure 1.1 Contact Tracing

1
1.1 Rationale

The population of a country increases day by day. This results in the increase of transmission of
viruses.
There are many viruses which are contagious and more dangerous such as HIV and measles.
We should implement new Technologies to solve modern problems and similarly in
the field of health.
The technology Machine Learning, which is added to the existing methods, results in more success.

1.2 Goal

Our Main goal is to:

Predict the infected persons who are in closely contact with the infected person.

1.3 Objective

The target of our efforts is to:

● Predicting the infected persons.

● Analyzing the Data.

● Understanding the Graphs/Clusters.

● Testing for the different names(input).

1.4 Methodology

Machine learning

Machine learning is the study of algorithms that improve over time as they gain experience from
data. It has become a common technique in practically every operation that demands information
extraction from massive data sets over the previous few decades. We are surrounded by machine
learning technology: search engines learn how to deliver the best results to us (while placing pro
table advertisements), and anti-spam software is always improving. Our email messages are filtered

3
by a machine-learning algorithm, and credit card transactions are safeguarded by software that learns
how to detect fraud. Face detection in digital cameras is assisted by machine learning, and intelligent
personal assistant apps on smartphones learn to identify spoken instructions. Accident-prevention
systems in vehicles are focused on machine learning algorithms and artificial intelligence.
In addition to bioinformatics, daily uses medicine, and astronomy, machine learning
is widely employed in scientific applications. In contrast to more traditional uses of computers, all
these applications have one thing in common, A human programmer cannot provide an explicit,
precise detailed specification of how such tasks should be accomplished in these instances due to the
complex of the patterns that need to be recognized. Many of our skills are obtained through learning
from our experiences, as evidenced by intelligent human beings (rather than following instructions
given to us). The goal of machine learning tools is to provide programs/algorithms with the ability to
learn and adapt.

Time (hour, day, month, year) and location (latitudes and longitudes). are some inputs to our
algorithms.

The type of crime that is most likely to have occurred is the output. We use a variety of classification
algorithms on Our Dataset, including DBSCAN algorithm.

The data set we're using was gathered from the internet.

Our Dataset

The dataset we're using was scraped from a website dedicated to data science that is open to the public. The
dataset contains nearly 100 entries with 4columns.
Features of this dataset
● Names of persons
● Timestamp
● Latitudes
● Longitudes

5
Figure 1.2 Sample of Dataset

1.5 Role and Responsibilities

6
Name Responsibility

D. Sanjana Reddy o Interface Installation


o Coding
o Documentation

N. Varshitha o Data Analysis


o Error Checking
o Documentation

M. Shemith Reddy o Data Analysis


o Installation
o Error checking

7
1.6 Contribution of Project

1.6.1 Market potential:

When a virus outbreak arises, public health ministry's utilize machine learning to track infected or
contacted persons to stop the spread of the infection.

Individuals who may have been exposed to the virus are provided with the information.

And directions on how to recognize their risk and what signs to look out for, as well as methods.

If they start to experience any symptoms, to protect others from infection.

In the situation that they become ill because of their interaction with the infected person, they encourage more
people to stay at home and maintain a social distance from others for two weeks. They should keep an eye on
themselves by taking their temperature twice a day and looking for any signs of the virus.
It’s an effective way to stop the virus spread. By this process we collect the data of
The people. Machine learning algorithms like DBSCAN algorithm helps us to get the
clusters and analyze the data with graphs

1.6.2 Innovativeness

The idea behind this project is to trace the infected people by using the GPS locations of the persons
who got infected. We carry forward to the next step in tracing. We use the DBSCAN clustering
algorithm to find the persons who may have gotten infected. Machine learning algorithms provide the
best and most accurate output for tracing.

8
Figure 1.3 GPS Tracking

Based on the GPS locations of the persons who are present at same location and at the same time they will
be picked for the verification of infection.

1.7 Usefulness

Contact Tracing is an integral part in defeating the pandemic. Contact tracing ensures that the virus infection
is limited and stops the pandemic to the next stage i.e. It stops the community transmission.

One of the major uses of the tracing is to alert the people who may have been exposed to someone who had
infection and prevent them from virus and spreading it to others by isolating them.

2. REQUIREMENT ENGINEERING

2.1 Functional Requirement


The functional requirements define the application's core functionality.

2.1.1 Interface Requirement


• A Dataset can be used as input for processing further operations.
• Jupyter notebook to be installed.

• Numpy, seaborn, sklearn packages installed.

• Jupyter command box is used to write and run code.

• Displays the output in the same jupyter interface.

9
2.2 Non-Functional Requirement

Nonfunctional needs are system requirements that have nothing to do with the specific functionality provided
by the system. They could be linked to emergent features like dependability, usability, and so on...

● To deliver the highest degree of accuracy.

● Provide a visual representation of the data.

● the ease with it that can be utilized.

● Accessibility.

● Stability.

● Readability.

3. ANALYSIS AND DESIGN

3.1 Use case diagram

Use case diagrams depict the system's entire scenario. A scenario is nothing more than a series of steps that
describe a user's interaction with a system.
As a result, a use case is a collection of scenarios linked by a common aim. The use case diagrams are created
to show the system's functionalities.

10
Figure 3.1 Use Case Diagram

3.2 Activity Diagram

The activity diagram is a graphical representation of the interaction flow inside a certain scenario.
It's comparable to a flowchart in that it depicts the many operations that can be carried out in the
system.

11
Figure 3.2 Activity Diagram

3.3 Class diagram

A class diagram in the Unified Modeling Language (UML) is a form of static structural diagram in software
engineering that depicts the structure of a system by displaying the system's classes, properties, operations (or
methods), and relationships among objects.

12
Figure 3.3 Class Diagram

3.4 Sequence diagram

The interaction of the thing with the other object is depicted in the sequence diagram. There is a representation
of a sequence of occurrences.
It is a time-based perspective on the relationships between elements in order to fulfil the framework's behavior
goal.

13
Figure 3.4 Sequence Diagram

3.5 System architecture

The process of identifying the subsystems that make up the system, as well as the framework for
subsystem control and communication, is known as system architecture design. The architectural
design's purpose is to define the software system's general structure.

14
Figure 3.5 System Architecture

4. CONSTRUCTION

4.1 Implementation
The project is implemented using the Python programming language. To be more specific,
Anaconda is being utilized for machine learning purposes. Anaconda is a Python distribution that
comes in a variety of styles. Anaconda is a new Python distribution. Continuum Analytics was its
previous name. More than 100 new packages have been added to Anaconda. Scientific computing,
data science, statistical analysis, and machine learning are all areas where Anaconda is used.

We found Anaconda to be more user-friendly when it came to Python technology. Because it aids in
the treatment of the following issues:
• Python installation on a variety of platforms.

• Differentiating across distinct environments.

• Dealing with the consequences of not having the appropriate privileges.

• Getting certain packages and libraries up and running.

This information was gleaned from the datascience.com website. Implementation of the idea
started to stop the virus outbreak or reduce the infected persons. The data was modified a
little according to the locality in terms of names and location. The data set format contains id, User,
Time stamp, Longitude, Latitude. From these, username will be given as the input then the model
gives the names of the people who have the more probability of getting infected by the input person.

After the data set is given to the model it will be analyzed using the scatter plot with Latitude on X-

15
axis and the Longitude on Y-axis. Atleast 10 or more entries will be taken for each user at different
locations along with Timestamp so that it forms more clusters.

4.2 Implementation Details

PANDAS :--

It is a python library used for analyzing the dataset. The Functions of pandas are analyzing, cleaning, exploring
and manipulating the given dataset.

It allows us to analyze and clean big data and make the conclusion based on theories.

It can clean messy data sets, and make them relevant.

The installation of pandas can be installed easily by typing the command as follows.

C:\Users\Name>pip install pandas

NUMPY :--

NumPy is a Python module that allows you to interact with arrays. Numerical Python is what it stands for.
It has matrices and transform functions in the domain of linear algebra.

NumPy aim is to provide an array which is more faster than traditional Python lists

It can be installed by using C:\Users\Name>pip install numpy.

SEABORN :--

Seaborn is a fantastic Python visualisation toolkit for statistical graphic plots. To make statistical charts more
appealing, it provides default styles and colour palettes. It is based on the matplotlib software and is tightly
connected with pandas data structures.
The goal of Seaborn is to make visualisation a core aspect of data exploration and understanding.
it provides dataset-oriented Application interfaces (APIs), allowing us to move between several visual
representations for the same variables for improved dataset understanding.
Plots are mostly used to represent the relationship between two variables.

16
There are different types of plots that are categorized as follows:
• Relational plots:- This plot is used to figure out how the two variables are related.
• Categorical plots:- This plot is about categorical variables and how to visualize them.
• Distribution plots:- These graphs are used to look at univariate and bivariate distributions.
• Regression plots: In seaborn, regression plots are used to offer visual guidance that helps to
emphasize the trends in a dataset during data analysis.

• Matrix plots: Matrix plots are an array containing scatterplots that are used to analyze data.
• Multi-plot grids: Drawing numerous instances with same plot on different subsets of the dataset to make it
comprehensible is a beneficial strategy.

For installing Seaborn, we type the command as follows:


In python: pip install seaborn
In Conda: conda install seaborn

It can be used to represent the random distributions of the parameters.

The following graph is the basic example for the seaborn scatter plot.

17
Figure 4.1 Seaborn Scatterplot

4.2.1 Clustering

Unlabeled clustered data, which may be done with the sklearn. cluster module.
Every clustering technique has two versions: a class which performs the fit method to acquire the clusters upon
on train dataset, and a function that produces an array of integer tags corresponding to various clusters when
provided train data. They are different methods in clustering. They are:
• K-means

18
• Affinity Propagation
• OPTICS
• BIRCH
• Mean Shift

• Spectral clustering
• DBSCAN
• Hierarchical clustering
Algorithms and procedures are employed to ensure appropriate implementation and operation. The
algorithm utilized is as follows:

DBSCAN algorithm

DBSCAN is a density-based spatial clustering of applications with noise. This can detect arbitrary-shaped
clusters as well as clusters with noise (i.e. outliers).The DBSCAN algorithm works on the principle that a point
belongs to a cluster whether it is close to many other points from such a cluster.

DBSCAN has two important variables:

• eps: The distance between neighborhoods that defines them. If the distance between two points is
below or like eps, they are termed Neighbours.

• minPts: The minimal number of data points required to define a cluster is minPts.

Points are classified as core, boundary, or outlier two of the following two parameters:

● Core point: If there are minimum minPts number of points (along with the point itself) in its
neighborhood region with radius eps, it is a core point.

● Border point: A point would be a border point if it can be reached from a core point and the
number of points in its immediate vicinity is fewer than minPts.

Outlier: If a point isn't a core point and can't be reached from any other core points, it's called an outlier.

19
Figure 4.2 DBSCAN Clustering

The value of minPts is 4 in this example. Because there are at least four points of radius eps within its
surrounding(neighborhood) region, red points are considered core points. The circles there in the figure depict
this area. Because they may be reached from a core point and have fewer than four points in their
neighborhood, the yellow points are considered very border points. The term "reachable" refers to points that
are within a certain radius of a central point. Within the neighborhood of points B and C, there are two points
(along with the point including itself) ( the surrounding area with which a radius of eps). Finally, since N is not
a core point and cannot be visited by one, it is an outlier.
How the algorithm works:

● Initially, minPts and eps are determined.

● An initial point will be chosen at random from its surrounding area using radius eps. If there
should be at least minPts number of points in the neighborhood, that point is designated as a core
point, and a cluster formation begins with it. If this is not the case, the point is labelled as noise.
When the development of clusters (let's call it cluster A) begins, every point in the vicinity of the
starting point becomes a member of cluster A. If the newly added points are already core points,
the points in their immediate vicinity are likewise added to the cluster A.

● A point that has been labelled as noise can be revisited and become part of a cluster.

20
● The next step in this process is to select another point at random among those that were not
visited there in previous steps. Then follow the same instructions as before

● When all of the points have been covered, the process is finished.

● A distance measuring approach, similar to that used in the k-means algorithm, is used to determine
the distance between each pair of points. For this, the Euclidean distance approach is often
utilized.

● The DBSCAN method can obtain high density areas and differentiate them over low density
regions utilizing these steps.

● A cluster is made up of neighbors (i.e., points that may be reached from one another) and the
boundary points of these neighbors. The presence of at least one core point is required for the
formation of a cluster. We will have a cluster of only a single core point and all its boundary
points, which is extremely unusual.

DBSCAN's Advantages and Disadvantages:

DBSCAN’s advantages:

● It is not necessary to mention the number of clusters that have formed.

● It works well with clusters of arbitrary shapes.

● DBSCAN is resistant to outliers and will also be able to identify them.

DBSCAN’s disadvantages:

● In some cases, calculating a suitable neighborhood(eps) distance is difficult and may need domain
expertise.

DBSCAN is not well suited for determining clusters when in-cluster densities are substantially varied. A
combination of eps-minPts variables can be used to specify the clusters' properties. We can't generalize
successfully to clusters with significantly varied densities because we only provide the algorithm one eps-
minPts combination.

4.2.2 Data Visualization


Clusters visualization using DBSCAN algorithm

21
Figure 4.3 Data Visualization 1
In the above image three different colors represents three different clusters and the blue dots
represents the outliers

clusters visualization for our project is as shown below:

22
Figure 4.4 Data Visualization 2

As the eps for this cluster is very less i.e., Six feet distance between each person should be
maintained so the number of clusters formed will be more.

23
Figure 4.5 Data Visualization 3

#visualizing the data in boxplots with id (x-axis) and latitude (y-axis)


sns.boxplot(x= 'id', y= 'latitude', data = df, palette = 'coolwarm')
plt.tight_layout()

24
Figure 4.6 Data Visualization 4

Code Implementation

1. Importing and installing libraries:

2. To extract the dataset by the following code:

df=pnd.read_json (r‘C:\Users\dell\Desktop\mockdata.json’)
df.head()
26
s
27
4.Main code:

4. Formation of clusters:

labels = model.labels_
fig = plt.figure (figsize = (12,10))
sns.scatterplot (df ['latitude'], df ['longitude'], hue = ['cluster-{}'.format(x) for x in labels])
plt.legend(bbox_to_anchor = [1, 1])
plt.show()

28
#will return the infected name back based on the inputed id
get_infected_names('Alice')

Output:
['Judy', 'Carol', 'David', 'Frank', 'Erin', 'Bob', 'Grace', 'Ivan',

'Heidi']

4.3 Software Details

• Python (v3.6.5)
• Colab Notebook

29
4.4 Hardware Details

• Operating system: Windows 7 or newer, 64-bit macOS 10.9+, or Linux.

• System architecture: 64-bit x86, 32-bit x86 with Windows or Linux.

• CPU: Intel Core 2 Quad CPU Q6600 @ 2.40GHz or greater.

• RAM: 4 GB or greater.

4.5 Testing

Software testing is an investigation conducted to provide information to partners regarding the


nature of the goods item or administration being tested. Software testing can also provide a neutral,
objective viewpoint on the product, allowing the company to see and understand the risks associated
with programming implementation. The method for executing a programme or application with the
goal of identifying programming errors (mistakes or other deformities) and ensuring that only the
product item is fit for use is known as testing strategies.

Programming testing entails the implementation of a product portion or framework segment in order
to evaluate at least one curiosity property. These properties show to what extent the section or
framework underneath test:

● meets the requirements that guided its design and development,

● responds effectively to a variety of data sources,

● performs its functions in a timely manner,

● it is sufficiently usable,

● can be installed and used in its intended environment, and

● achieves the overall result that its partners desire.

Because the number of possible tests for even simple programming components is practically endless,
all product testing employs some process to select tests that are feasible given the time and

30
resources available.As a result, programming testing typically (but not always) involves running a
programme or application in the hopes of finding programming errors (mistakes or different
imperfections). Testing is an iterative process because when one fault is corrected, it can reveal
other, more serious bugs or even create new ones.

4.5.1 White box testing

White-box testing is a programmer testing approach that examines an application's internal structures or
activities rather than its utility (for example discovery testing). White-box testing employs an insider's
perspective of the framework, similar to programming skills, to conduct configuration tests. The analyzer
selects contributions to test different paths through the code and determine the standard yield.

4.5.2 Black box testing

This is intended to reveal errors in requirement specification without regard for the project's
internal workings. This testing concentrates on the project's information domain, with the test
case developed by splitting the input and output domains of programming – A method that
ensures complete test coverage.

5. CONCLUSION AND FUTURE SCOPE

5.1 Conclusion

Contact tracing is one of the techniques we can utilize technology to save lives and provide people the
treatment they need as soon as possible. Government and medical personnel frequently have accessibility to
the GPS coordinates of some patients. The approach we went through in this study is almost the same as the
method officials go through to collect probable diseases. Fortunately, packages like sklearn allow us to apply
existing models on our datasets to generate results.
The creation of a contact tracking system is among the most crucial steps to take to restore the global economy

31
as safely as feasible. Contact tracing is our sole option for isolating the viruses when it occurs when social
distancing is loosened, and this is especially useful in the case of coronavirus.

5.2 Future Scope

The goal of our project is to reduce manual work by adding new technologies like machine learning/artificial
intelligence. This is the starting stage of project applications. As we move further one can explore and add a
greater number of technologies to the manual projects which reduce human stress leading to the development.
Contact Tracing is one of such examples which adds machine learning for finding the contacts easily and this
helps the humans reduce the work.

Predicting Future Infectious Diseases:

Like Corona Virus, more contagious diseases can be easily identified by new technologies. Our model is very
helpful in the case of finding the contacts easily.
This reduces human work and can be updated by adding more technology for further projects. Also, it plays an
important role in the health care system. Not only in the healthcare system, but the technologies can also be
added in any field, where human work is more.

5.3 Expected Outcome

This contact tracing method takes one of the usernames from the data. By analyzing it using the GPS
location and the timestamps which we provide in the dataset, it gives the names of those people who may
get infected, based on the distance they maintained with the infected person. This type of data analysis would
have been technologically unfeasible just a few years ago, but current advances in machine learning appear to
be up to the challenge.
The use of machine learning methods such as the DBSCAN algorithm makes data analysis much easier.

32
References

[1] https://machinelearningmastery.com/clustering-algorithms-with-python

[2] https://www.kdnuggets.com/2020/04/dbscan-clustering-algorithm-machine-learming.html

[3] https://thecleverprogrammer.com/2021/02/03/dbscan-clustering-in-machine-learning

33

You might also like