MMMMM

You might also like

Download as pdf
Download as pdf
You are on page 1of 23
PROJECT -2 A PROJECT REPORT DATA ANALYSIS AND VISUALISATION OF COVID-19 1. BABULAL KUMAR 2.PRADEEP KUMAR in partial fulfilment for the award of the degree of BACHELOR OF TECHNOLOGY in ELECTRONIC AND COMMUNICATION ENGINEERING CENTURION UNIVERSITY DEPARTMENT OF ELECTRONIC AND COMMUNICATION ENGINEERING SCHOOL OF ENGINEERING AND TECHNOLOGY PARALAKHEMUNDI CAMPUS. SUNeS ob el Certified that this projectDATA ANALYSIS AND VISUALISATION OF COVID-19” report “ is the Bonafide work BABULAL KUMAR ,PRADEEP KUMAR Whocarriedout ofthe project work under my supervision. This is to further certify to the best of my knowledge, that this project has not been carried out earlier in this institute and the university. SIGNATURE 3YOTI RANJAN SHAOO Dept. OfELECTRONIC AND COMMUNICATION Engineering Certified that the above mentioned project has been duly carried out as per the norms of the college and statutes of the university SIGNAT URE (Prof . Prabhat kumar Patnaik) HEAD OF THEDEPARTMENT ELECTRONIC AND COMMUNICATION ENGINEERING DEPARTMENT SEAL DECLARATION | hereby declare that the project entitled “DATA ANALYSIS AND VISUALISATION OF COVID-19" submitted for the “Python project” of 1ST semester B. Tech in ELECTRONIC AND COMMUNICATION Engineering is my original work and the project has not formed the basis for the award of any Degree /B.Tech or any other similar titles in any other University / Institute. BABULAL KUMAR (230101130066) PRADEEP KUMAR (230101130069) Place: Centurion University of Technology and Management , Paralakhemundi ,Odisha. Date:25/12/2023 Pepe ‘TABLE OF CONTENTS CHAPTER NO. me CERTIFICATE DECLARATION ACKNOWLEDGEMENT UST OF TABLE ABSTRACT UST OF FIGURES LIST OF SYMBOLS / NOTATION CHAPTER -1 ABSTRACT CHAPTER -2 INTRODUCTION CHAPTER-3 LITERATURE SURVEY CHAPTER -4 — IMPLEMENTATION 4.4 METHODOLOGY 41.2 Step 1: Cleaning the Dataset 413 Step 2: Data Visualization 41.4 Step 3: Computing Accuracy CHAPTER -5 — SYSTEM REQUIREMENTS 5.1. GENERAL DESCRIPTION 5.1.2. HARDWARE REQUIREMENTS 5.1.3. SOFTWARE REQUIREMENTS 5.1.4. USER REQUIREMENTS: 6. CHAPTER- 6 RESULTS 61 DASHBOARD 7. CHAPTER - 11 CONCLUSION 8. CHAPTER -13 REFERENCE PAGE NO. ABSTRACT The Covid-19 pandemic has shaken the world completely. No one knew what was coming and everyone was running helter-skelter. The governments were paralyzed and the infrastructure required to deal with this problem was absent completely. The genome sequence was out. But what the disease entailed and what it will lead out was just anyone’s imagination. Till today as we write there are multiple dimensions of it that lay unexplored and need a deep exploration to be found out. Our Project seeks to uncover the mystery using the application of data sciences to solve it. We seek to use data sciences to help authorities and also to give the medical field the insight that data can provide to them to deal with the pandemic better. Data science is the application of data science algorithms and machine learning to train the models to find patterns. Patterns reveal what the common issues are and common symptoms and everything that is common comes out in a visual representation. It's these representations which make complex things easy and digestible to people from non tech backgrounds. Use of data science in such a pandemic will lead to greater insights in the data we are working on. A huge dataset of people suffering from Corona virus to give us better ways of fighting the pandemic. Data-sciences in our project is being applied to just the Corona virus but its applications are wide ranging and can be applied across sectors of diseases to diagnose better. In- fact data science is the new method of diagnostic and can lead to even better cure for diseases. It’s this frontier we seek to find from our project. 1.1 INTRODUCTION Covid-19 cases are increasing day by day in and all over the world, millions of people are dying and the economy is experiencing free-fall. It has been spreading like water in the open ocean and it seems like there is no stopping it, But thankfully since the inception of this epidemic many countries have properly managed the databases of each and every patient and their health history. We today have advanced computational infrastructure and data science algorithms through whichwecananalysethesedatasetsandgaininsight-full information so that we can help the society. People have proposed many interesting models and trend prediction methods.The project will help us in recognizing the insights that will be gained by using data science algorithms on the data, these insights will help us in identifying and giving an idea of how the number of covid cases are impacted as possibility of being diagnosed positive on the basis of the symptoms . 1.2 LITERATURESURVEY According to the research paper [1], the authors, R. Wang, G. Hu, C. Jiang, H. Lu and Y. Zhang, have compared the prediction of patterns by using 3 methods and comparing their graphs with each other. These models are the conventional logical regression model, the Particle Swarm Optimization SIR model and the Lowest Square approach SIR model. The chart ultimately shows some patients with a novel form of X-axis coronary pneumonia, and Y-axis date. By seeing the three patterns we come to know that the data is plotted in the form of a curve. The public figures of daily updated confirmed instances of Covid-19 from University John Hopkins were analysed in this study article [2] proposed by V.Z.Marmarelis.[2]. RM as described by Riccati Equation, is the main modelling element for the method. The public figures of daily updated confirmed instances of Covid-19 from University John Hopkins were analysed in this study article [2] proposed by V.Z.Marmarelis etal. [2]. RM, as described by Riccati Equation, is the main modelling element for the method. Further by applying the equation we find 5 different parameters and their dependence on the no. of cases increasing day by day”. Everyone analysed knowledge on coronary disease and sustainable therapy utilising research articles from Gerry Wolfe*, Ashraf elnashar*, Will Schreiber* Izzat Alsmadi*. " Guided by COVID-19 Literary Clustering of the Datasets from Kaggle based on COVID-19. [3] The data were further divided into four: (1) Mobility social distances, (2) Health and COVID; (3) Economic impact; and (4) Vulnerable population, and were utilised in a second dataset from MTI. The document has been analysed and text has been processed in order to produce tokens for clustering and the use of the K-Median method to label data to assist extract and analyse categorised data. According to Tuli,[4] the epidemic may be tracked extremely efficiently via Shrestha et al Machine Learning (ML) and Cloud Computing, anticipate an outbreak of the illness, and create appropriate policies to regulate its expansion. Then given the array, face extraction and collection is done. They have proposed a Machine Learning model that can be run continuously on Cloud Data Centers (CDCs) for accurate spread prediction and proactive development of strategic response by the government and citizens. The dataset used by them in this case study,World in Data by Hannah Ritchie. They have also used a cloud framework and azure instances for real time analysis of data. The research paper [5] Francisco Nauber,Bernardo Gois et al. have emphasised the rising popularity of epidemic behaviour prediction research due to their capacity to anticipate the natural course of viruses. This study presents several predictor approaches with machine training, logistic regression, filters, and epidemiological models in order to explain COVID- 19's behaviour. The research paper [6], the authors Yazeed Zoabi, Shira Deri- Rozov and Noam Shomron have acknowledged that accurate SARS-CoV-2 screening allows for fast and efficient COVID- 19 diagnosis and reduces the strain on health care systems. Prediction models using many characteristics have been created to assess the likelihood of infection. The model projected 0.90 auROC in the forward- looking test set (area under the receiver operating characteristic curve). The research paper [7], authors Enis Karaarslan and DoganAydin mentioned thatThe incident at COVID-19 showed that the world was unwilling to disseminate the virus so rapidly. One crucial factor in mitigating the detrimental impacts of an epidemic or pandemic is the effective use of information technology. They suggested a management epidemic system (EMS), which relies on the unfettered and timely flow of information between states and organisations. They have been using an MPISA paradigm, which allows different platforms to be integrated and gives the solution for issues of scalability and interoperability. [8] This paper Describes the use of a new epidemiological compartiment- based model for the estimation of the propagation of the coronavirus CO VID-19, that is, SEIAR(Susceptible Exposed Asymptomatic Infectious Recovered). This is accomplished through the heuristic approach of differential evolution. In this way the day(s) when that number reaches its maximum, the associated value and the future evolution of its spread may be evaluated in approximate order for different situations. The [9] authors Ayyoubzadeh Set all have Used computerised data mining technologies for improved insights on the outbreak of COVID-19 in each country and globally for the management of the health catastrophe. Google Trends website collected data. For estimating the number of positive COVID-19 instances, linear regression and long-term memory (LSTM) models were utilised. [10] The study document [7] by Amir-Sardar Kwekha Rashid, Heamn N Abduljabbar and Bilal Alhayani shows that in COVID-19 research, hypotheses may be proved to be deterministic, transforming into clear findings and predictions. The outcomes of supervised learning algorithms are better than those of 92.9% of uncontrolled learning algorithms. The assistance for the development of standard diagnostic procedures like IgM, IgG, X-ray chest, CT- scans and RT-PCR can be seen as an artificial intelligence and deep learning. The CNN Algorithms selected to perform this study are MobileNet, DenseNet, Xception, ResNet, InceptionV3, InceptionResNetV2, VGGNet, NASNet. 10 1.3IMPLEMENTATIONMethodology We are using Machine Learning to give predictions on the basis of data taken from government website[11], and then we clean the data by using excel cleaning methods and give prediction by using the algorithm with highest accuracy to predict COVID -ve or +ve on basis on 5 major symptoms. The process can be explain in following points : 1. First, Take the dataset, remove redundant data and organise the data according to our needs. 2. Second, Load the dataset on the Jupyter Notebook and apply data visualization techniques to understand the data better. 3. Third, then we calculate accuracy for various algorithms and plot a graph on the basis of accuracy of various algorithms. 4. Finally, using the accuracy graph we finally use the algorithm with best accuracy in this case (Decision Tree Classifier) to predict the person is either -ve or +ve on the basis of symptoms. 3.2 Description of the Process We are building our own COVID Prediction System using Jupyter Notebook. We can describe the process in following steps : Step 1: Cleaning the dataset The very first step in our project is to get a reliable and authentic dataset for the prediction and analysis. Our search for dataset ended on [11] which is govt website which has provided dataset for free use and is absolutely authentic. Then next thing we did was to clean the dataset and remove unwanted columns from dataset for faster computation. Step 2: Data Visualization Here, we use the dataset and check the consistency of the dataset by checking the values out of the dataset randomly. Then we do data visualization for better understanding of data by the use of various 10 plots, graph and heatmaps. All this graphs and plots gets us an insight into huge datasets easily. Step 3: Computing Accuracy In this step we compute accuracy of all the algorithms by checking the four algorithms mentioned here: Logistic Regression, KNN, Random Forest Classifier, Decision tree Algorithm , we selected these algorithms on the basis of their qualities of regression & classification. 5. SYSTEM REQUIREMENTS General Description Data Analytics on Covid-19, as the name suggests is a data analytics on the data such as the people infected,what their age is ,;what are the sources that they have been infected from, history of any previous chronic diseases etc. and we wish to obtain almost all the meaningful insights that we can get using various data science and machine learning techniques and by looking at those insights we can arrive at or basically predict the future trends or other crucial information. It requires active internet connection because the project uses various Machine Learning models depending on how we want to train our data.The various tools and library that we intend to use are with the intention that using them we can get the “best of the waste” and provide some services to the society.Hence we look forward to achieve what we have intended and hope the analysis turns out to be a success. HARDWARE REQUIREMENTS 1.High Resolution Camera 2.RAM: 4 GB 3.Processor: Intel i5 or Higher 1 4. 2 GB Graphics Card SOFTWARE REQUIREMENTS 1.Windows 10 or Higher 2.TextEditor 3. Python 3.9.0 4. Open CV 5. Jupyter Notebook Non-functional and functional requirements System functional requirement defines the operations and services tobe provided by the system :- 1.Using Jupyter Notebook, the csv file is manipulated for getting meaningful insights. 2.OpenRefine for data scrubbing. 3.Numpy,Pandas,Matplotlib for data exploration,inspection and visualisation. 4.For modeling the data we need a decent knowledge of the Scikit library of Python. 5.Training the dataset 6.Matplotlib,ggplot,Seaborn, Tableau or d3js for interpreting the data.. Non-functional Any features or qualities of the system capable of evaluating its operation are the requirements. They are clarified by the following points: 2 1. RELIABILITY:- The insights that we are aiming to obtain should be highly reliable with minimum faults or miscalculations.Every parameter of the dataset is mentioned and observed properly and the insights that we arrive at, are cross checked from practical/previous observations. 2. SCALABILITY:- Since new records are added to our dataset on a daily basis our model should be scalable to adopt the dynamic nature of our dataset. 3. SECURITY:- Our project is mainly dependent on the covid19 database from an open source data repository ,there is a high chance of data loss due to hackers or attackers.So our system should be secured by using anti-malware software,regular backup etc. 4. MAINTAINABILITY:- The system requires good maintainability from our side due to the dynamic nature of the dataset.Since there might be days when there is a sudden surge in the number of daily cases abruptly and we need to be ready for such data too. USER REQUIREMENTS 1.The data analysis system shall input and accurately compare the given parameters with the previously stored data. 2.Upon comparing the new input parameters the probability of having covid or not is displayed as a percentage. 3.A front-end interface for taking the symptoms parameters from the patients is present. 4.The user’s parameters are compared against the test cases on which the model has been trained. 13 2.1. DATA SET import pandas es pd import matplotlib.pyplot as plt import plotly.express as px from plotly.subplots import make_subplots import plotly.graph_objects as go [5] dataset_url = '/content/country_wise_latest.csv’ CHAPTER # Read the dataset into a pandas DataFrame df = pd.read_csv(dataset_url) print (df.head(5)) BUNHO BUNnHO awKHo 14 2 Country/Region Confirmed Deaths Recovered Active New cases Afghanistan 36263 1269, 25198 9796 106 Albania 4880148 27451981 117 Algeria 279731163, 188377973, 616 Andorra 907 52 803 52 10 Angola 9500041 242667 18 New recovered Deaths / 100 Cases Recovered / 100 Cases \ 18 3.50 69.49 63 2.95 56.25 743 4.16 67.34 @ 5.73 88.53 ° 4.32 25.47 Deaths / 109 Recovered Confirmed last week 1 week change \ 5.04 35526 737 5.25 171 703 6.17 23691 4282 6.48 884 23 16.94 748 2e1 1 week % increase WHO Region 2.07 Eastern Mediterranean 17.00 Europe 18.07 Africa 2.60 Europe 26.84 Africa New deaths \ 16 Bown 2.2 RESULT © «+ tosuning oF is your Ostafrane with COVID-19 data countries = ("India "Chine, "Tape, “Gareany, “United Kingéoa', “Aahanistan', “Russa', ‘Canada’, “Saudi Arabia", “Tarael"] af = oat Country/Region’ sin( countries) 4 Create an interactive bar graph Fig = pasbar(@f, Country/Region’, enfiread!, tites'Confirwed COVID-19 Cones") 4 Dlsplay the gronh fia.show) © 46 = cf.ntargest(10, ['Deaths’, ‘Recovered’ }) # Create traces tracel = go.Bar( xzdf[ ‘Country/Region’ ], ‘Recovered’ }, name="Recovered", markeredict(color='rgb(111, 53, 876)') ) trace? = go.Bar( xadf[ ‘Country/Region’ ], df[ ‘Deaths"}, name="Deaths', marker=dict (color 1rgb(96, 188, 25)') ) # Create @ layout layout = go.Layout( titles 'COVID-19 Recovered vs Deaths for Top 10 Countries’, # Create a figure and add traces to it fig = go.Figure(dates[tracel, trace2], layout=layout) # Display the graph fig.show() 15 # Sort by ‘Active’ and select the top 10 af = af.nlargest(10, ‘Active') # Create 2 pie chart fig = px.pie(df, valuese'Active', names='Country/Region’, titles'Aactive COVID-19 Cases for Top 10 Countries") # Display the graph Fig.show() Active COVID-A8 Cates for Top 10 Counties Os centre ster «olor tes dt!) 4 Crate a scatter plot forthe top 2 cours fig = prscatten(tp_1-comtries scatter, re cses', ol deaths, hve dt Cortry/epion, titles Cases vs Neu Deaths (10 Coutries) 1 Display the rh fig.sho() 16 ew cae he Det (10 Countries) 4 a _., = 1 gay ge tee) B 1 Week % Ieease for Tep 10 Countries pune eam om 7 [28] # Sort by ‘Deaths’ and select the top 10 top_countries = df.nlargest(1@, ‘Deaths*) # Create bar for each country bars = (] for country in top_countries{ ‘Country/Region’ ]: country_df = df[df{ 'Country/Region’] == country) bars. append(go.Bar(name=country, x=country_dfl 'Country/Region'], y=country_df['Deaths'])) ¥ Create @ layout layout = go.Layout( title='Deaths in Each Country’, barmode=" group’ ) # Create a figure and add bars to it fig = go.Figure(datasbars, layout=layout) # Display the graph Fig.show() Deaths n Each Country 18 CHAPTER -3 DASHBOARD CODE | Sort by ‘Deaths and select the top 10 top 30 countries = dfsnlargest(19, Deaths") ‘+ create traces for the first plot ‘raced = go.8ar( sotop 40 countries( ‘Country/Region’ Jy yrtop te courtries| "Recovered, trarkarscict(eolor’rgb(211, 53, 876)*) arkeradict(colore’rgh(95, 188, 25)*) ) 1 create Layout for the first plot Inyo = posLayont( ‘itles"COVID-19 Recovered ve Deaths for Top 10 Countries’, armodee' grout ) 4s coante the First subplot figh = go-Figure(eaton(tracel, ‘race2], layovtelayout2) 1 create a scatter plot using the correct Datatrame (6F) for the second plet ‘gd = pecacatter af, xo 'len cares’, yooh deaths’, hover_eatar{ Countey/Repicn’), Sftlec"Meu Cazes ve Mew Deaths") |F Sort by “1 week ¥ increase’ ond select the top 18 for the third plot {op 0-countries Line = dFsnlangest(20, "1 neck X increase’) fig) « pr Line(top_30_coontrion line, x='Couery/Ragion', yu't week X increase’, titlee't Heck X Increase for Top 10 Countries’) So etry tn tap entries. ttl “eety/ aie”) Strat waa omen) me eae) Tenaya, enn Bete D) Hepat, states, 19 DASH BOARD COVID-19 Dashboard Recovered vs Deaths 1 Week % Increase for Top 10 Countries 0 New Cases vs New Deaths Ute Kngsom +0 Germany canads sigrnitan 200 Deaths in Each Country recite gn Sry Su snc ¥ Me, 20 CHAPTER -4 CONCLUSION The Covid - 19 Pandemic is a huge struggle for all of us. The project we are making will seek to find the answers to the most pertinent questions as to what is it that makes the covid 19 such a tragedy and what all people are the ones who are most affected by it. It will seek to find the appropriate response which can be mounted by the authorities concerned and we can reach to a place of proper discussion about the problem and solve it in the best possible manner out there. It will also lead to a solution to any medical condition we might encounter later on in our lives where we can apply data sciences for medical diagnostics. This project saves on the already limited resources that India have and prevents the spread as people can use it to get an idea that they should go and get tested .It also helps unhealthy and infected people to isolate themselves. Using this system we can effectively and efficiently mitigate the burden on our healthcare system which is completely stressed out. 21 CHAPTER -5 REFERENCES 8.(https://www.sciencedire ct.com/science/article/pii/B978012824536 1000058) [9] Data mining and Deep Learning Pilot study in Iran: Ayoubzadeh S, Ayoubzadeh S, Zahedi H, Ahmadi M, R Niakan Kalhori S Predicting Incidence via Analysis of Google Trends Data. JMIR PS [11] https://www.kaggle.com/code/syedmohammadafraim2/a- comprehensive- data-analysis-of-covid-19 Appendix The appendix should contain computer programming [if any), the sample, calculations, explanation of theory (if any) etcwhich will be used as reference. ASSESSMENT Internal: SLNO} RUBRICS, FULL MARKMARKS OBTAINED REMARKS 1 [Understanding the relevance, scope and] 0 dimension of the project 2] Metnodaroay 0 3 [Quality of Analysis and 0 % [Results Interpretations 70 5 —[ reggrenctasions 0 Total 50 Date: Signature of the Faculty 2 COURSE OUTCOME (COs) ATTAINMENT 1. Expected Course Outcomes (COs): (Refer to COs Statement inthe Sylabus) 1) Course Outcome Attained: How would you rate your learning of the subject based on the specified COs? o0ooaoo0oo0oo0o0a0ag0 0 1 2 3 4 5 6 7 8 9 10 Low HIGH >Learning Gap (if any): > Books / Manuals Referred: Date: Signature of the Student > Suggestions / Recommendations: (By the Course Faculty) Date: Signature of the Faculty

You might also like