Download as pdf
Download as pdf
You are on page 1of 21
111123, 1228 PM ML-Diabetes_ dataset -Jupyter Notebook We will use pima indian diabetes dataset to predict if a person has a diabetes or not based on certain features such as blood pressure, skin thickness, age etc. Importing the Libraries In [1]: import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt import plotly.express as px import warnings warnings. filterwarnings("ignore") Ymatplotlib inline localhost @868inotebooks/MI-Diabetes_ dataset pynbt wet 111123, 12:28 PM ML-Diabetes_dataset -Jupyter Notebook In [2]: df = pd.read_csv("pima-indians-diabetes.csv", names=[‘Pregnancies', ‘Glucose’, ‘t df -head() out [2]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI o 6 4B 2 35 0 336 0.627 1 1 85 66 29 0 266 0.2351 2 a 183 os ° 0 233 0872 3 1 89 66 23 94 28.4 0.87 4 0 137 40 35168 43.1 2.288 Checking the null values In [3]: df.isnull().sum() out[3]: Pregnancies Glucose BloodPressure SkinThickness Insulin BME DiabetesPedigreeFunction Age outcome dtype: intea We got zero null value Statistical Analysis localhost @868inotebooks/MI-Diabetes_ dataset pynbt 221 111123, 12:28 PM In [4]: df.describe() ML-Diabetes_dataset -Jupyter Notebook out[4]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabotesP: ‘count —768,000000 768.000000 768.0000 + 768,000000 768.0000 768.000000 mean 3.845052 120,894531 69,105469 -20,836458 | 79,790479 931,992578, std 3.369578 31.972618 19355807 18.952218 115244002 7.884160 min 0.000000 0.000000 0.000000, 6.000000 0.000000 0.000000 25% 1.000000 99,000000 62.0000 0.000000 0.000000 27300000 50% 3.000000 117.000000 © 72.000000 «3.000000 30.500000 32.0000 75% 6.000000 149.2500 80,000000 2.000000 127.250000 6.600000 max 17.000000 199.000000 122000000 —99,000000 846.000000 67.1000 Checking the distribution of the dataset 321 localhost @868inotebooks/MI-Diabetes_ dataset pynbt ‘11123, 12:28 PM ML-Diabetes_ dataset -Jupyter Notebook In [5]: plt.figure(figsize=(20,15), facecolor="white' ) plot_num = 1 for column in di if plot_num <= 9: # number of columns is 9 ax = plt.subplot(3,3,plot_num) sns.distplot (df[column]) plt.xlabel (column, fontsize=20) plot_num+= plt.show() f= “Pregnancies” ** Gucose “ “etaoarressure *“Stintnickness © esti i: ne i fa QC “DiabetesPedigreerunction. me age ee Ourcome We can see there is some skewness in the data . Also , we can see there few data for columns Glucose, Insulin, skin thickness, In [6]: df.columns Out[6]: Index(["Pregnancies', ‘Glucose’, "BME", dtype object") localhost @868inotebooks/MI-Diabetes_ dataset pynbt "BloodPressure’, "DiabetesPedigreeFunction’, BMI and Blood Pressure which have value as 0. th thei "SkinThickness', ‘Insulin', "Age, ‘Outcome’], 4a ‘11123, 12:28 PM In (7]: In [8]: ML-Diabetes_daiase -upyter Notebook # replacing zero values with the mean of the column df['BMI'] = df['BMI'].replace(@,dF["BMI'].mean()) df['BloodPressure'] = df['BloodPressure' ].replace(@,df[ 'BloodPressure' ].mean()) df['Glucose'] = df['Glucose’].replace(@,df[ ‘Glucose’ ].mean()) df['Insulin'] = df[' Insulin’ ].replace(2,df[ Insulin’ }.mean()) df['SkinThickness'] = df['SkinThickness'].replace(@,df['SkinThickness'].mean()) Distribution of data after replacing zero values with mean plt.figure(figsize=(20,15),facecolor='white') plot_num = 1 for column in di if plot_num <= 9: — # number of columns is 9 ax = plt.subplot(3,3,plot_num) sns.distplot (df[column]) plt.xlabel (column, fontsize=20) plot nuns plt.show() i r ee samc wae “Saeteieanrersnton age Sige So we deal with zero values. But there are some outliers in the dataset localhost @868inotebooks/MI-Diabetes_ dataset pynbt 521 111123, 1228 PM ML-Diabetes_ dataset -Jupyter Notebook In [9]: fig, ax = plt.subplots(Figsize=(15,10)) sns.boxplot(data=df, width= .5,ax-ax, fliersiz 3) out[9]: 845 + Es Teginces Gide Howie ations alan Tle Gabeasieaetincion We none Removing some amount of outliers from the dataset localhost @868inotebooks/MI-Diabetes_ dataset pynbt 621 ‘11123, 12:28 PM ML-Diabetes_ dataset -Jupyter Notebook In [10]: q = df['Pregnancies"].quantile(o.98) # we are removing the top 2% data from the Pregnancies column df_cleaned = df[df{ ‘Pregnancies’ ]

You might also like