Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

NO APPOINTMENT DATA ANALYSIS

INTRODUCTION
This dataset contains over 100k information of medical appointment of patients in Brazil
and the focus is to investigate the factors that affects patients attendance to their
appointment. We are to explore and determine what could make a patient not attend
his/her appointment. The dataset consist of 14 columns and below are the columns and
their descriptions.

Data Dictionary
Column Description

PatientId Identification of each patient


AppointmentID Identification of each appointment
Gender Male or Female
ScheduleDay What day did the patient set up for their
appointment
AppointmentDay The day of the actual appointment, when they
have to visit the doctor
Age How old is the patient
Neighbourhood The location of the hopital
Scholarship Medical scholarship (True or False: 0 =
False, 1 = True )
Hpertension True or False (0 = False, 1 = True)
Diabetes True or False (0 = False, 1 = True)
Alcoholism True or False (0 = False, 1 = True)
Handcap True or False (on a scal of 0 to 4)
SMS_received Messages sent to patients (0 = False, 1 =
True)
No-show Patients who attended or misseed their
appointment (Yes=Missed, NO=Attended)

#### Questions
1) Does gender affect if a person will show up for there scheduled
appointment?
2) Does age affect if a person will show up for there scheduled
appointment?
3) Does having a scholarship affect a patient showing up for his/her
appointment?
4) Does having a certain ailment affect if a patient will show up or
not?
5) How often do men show up as against women?

# importing of relevant libraries that will be used for this project


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#from IPython.display import set_matplotlib_formats
#set_matplotlib_formats('retina')

%matplotlib inline

# importing and reading of the file using pandas read_csv

data = pd.read_csv(r"C:\Users\Daisy Dickson\Desktop\Current Project\


noshowappointments-kagglev2-may-2016.csv")
data.head(5)

PatientId AppointmentID Gender ScheduledDay \


0 2.987250e+13 5642903 F 2016-04-29T18:38:08Z
1 5.589978e+14 5642503 M 2016-04-29T16:08:27Z
2 4.262962e+12 5642549 F 2016-04-29T16:19:04Z
3 8.679512e+11 5642828 F 2016-04-29T17:29:31Z
4 8.841186e+12 5642494 F 2016-04-29T16:07:23Z

AppointmentDay Age Neighbourhood Scholarship


Hipertension \
0 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0
1
1 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0
0
2 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0
0
3 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0
0
4 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0
1

Diabetes Alcoholism Handcap SMS_received No-show


0 0 0 0 0 No
1 0 0 0 0 No
2 0 0 0 0 No
3 0 0 0 0 No
4 1 0 0 0 No

## Assessing the data


# Using the 'shape' funtion helps us know the number of rows and
columns.
# It shows that this dataset has 110,527 rows and 14 colums

data.shape

(110527, 14)

# The 'info' function shows the datatypes, if there are null values
and its count.
# This helps us know if there are missing values.

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PatientId 110527 non-null float64
1 AppointmentID 110527 non-null int64
2 Gender 110527 non-null object
3 ScheduledDay 110527 non-null object
4 AppointmentDay 110527 non-null object
5 Age 110527 non-null int64
6 Neighbourhood 110527 non-null object
7 Scholarship 110527 non-null int64
8 Hipertension 110527 non-null int64
9 Diabetes 110527 non-null int64
10 Alcoholism 110527 non-null int64
11 Handcap 110527 non-null int64
12 SMS_received 110527 non-null int64
13 No-show 110527 non-null object
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB

# This is another way to check for missing values. From the output, it
shows there are no missing values for each column

data.isnull().sum()
PatientId 0
AppointmentID 0
Gender 0
ScheduledDay 0
AppointmentDay 0
Age 0
Neighbourhood 0
Scholarship 0
Hipertension 0
Diabetes 0
Alcoholism 0
Handcap 0
SMS_received 0
No-show 0
dtype: int64

# 'describe' function gives a statistical overview of the data. It


shows the count, mean, standard deviation(std),
# min, the quartile of 25%, 50% aand 75%

data.describe()

PatientId AppointmentID Age Scholarship \


count 1.105270e+05 1.105270e+05 110527.000000 110527.000000
mean 1.474963e+14 5.675305e+06 37.088874 0.098266
std 2.560949e+14 7.129575e+04 23.110205 0.297675
min 3.921784e+04 5.030230e+06 -1.000000 0.000000
25% 4.172614e+12 5.640286e+06 18.000000 0.000000
50% 3.173184e+13 5.680573e+06 37.000000 0.000000
75% 9.439172e+13 5.725524e+06 55.000000 0.000000
max 9.999816e+14 5.790484e+06 115.000000 1.000000

Hipertension Diabetes Alcoholism Handcap \


count 110527.000000 110527.000000 110527.000000 110527.000000
mean 0.197246 0.071865 0.030400 0.022248
std 0.397921 0.258265 0.171686 0.161543
min 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000
75% 0.000000 0.000000 0.000000 0.000000
max 1.000000 1.000000 1.000000 4.000000

SMS_received
count 110527.000000
mean 0.321026
std 0.466873
min 0.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 1.000000

# Checking the uniqueness of each column

data.nunique()

PatientId 62299
AppointmentID 110527
Gender 2
ScheduledDay 103549
AppointmentDay 27
Age 104
Neighbourhood 81
Scholarship 2
Hipertension 2
Diabetes 2
Alcoholism 2
Handcap 5
SMS_received 2
No-show 2
dtype: int64

# let's assess the exact unique values of some of the columns

print('Gender', data.Gender.unique())
print('Scholarship', data.Scholarship.unique())
print('Handcap', data.Handcap.unique())
print('No-show', data['No-show'].unique())

Gender ['F' 'M']


Scholarship [0 1]
Handcap [0 1 2 3 4]
No-show ['No' 'Yes']

Data Wrangling
# Renaming of some columns because of typos

data.rename(columns = {'Hipertension': 'Hypertension', 'Handcap':


'Handicap', 'No-show': 'No_show'}, inplace = True)
data.head()

PatientId AppointmentID Gender ScheduledDay \


0 2.987250e+13 5642903 F 2016-04-29T18:38:08Z
1 5.589978e+14 5642503 M 2016-04-29T16:08:27Z
2 4.262962e+12 5642549 F 2016-04-29T16:19:04Z
3 8.679512e+11 5642828 F 2016-04-29T17:29:31Z
4 8.841186e+12 5642494 F 2016-04-29T16:07:23Z

AppointmentDay Age Neighbourhood Scholarship


Hypertension \
0 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0
1
1 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0
0
2 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0
0
3 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0
0
4 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0
1

Diabetes Alcoholism Handicap SMS_received No_show


0 0 0 0 0 No
1 0 0 0 0 No
2 0 0 0 0 No
3 0 0 0 0 No
4 1 0 0 0 No

# change all columns to lowercase for easy coding

data.columns = data.columns.str.lower()
data.columns

Index(['patientid', 'appointmentid', 'gender', 'scheduledday',


'appointmentday', 'age', 'neighbourhood', 'scholarship',
'hypertension',
'diabetes', 'alcoholism', 'handicap', 'sms_received',
'no_show'],
dtype='object')
# change the value of No_show to '0 = if patient showed up' and '1 =
if patient did not show up'

data['no_show'] = data['no_show'].apply(lambda x: 0 if x == 'No' else


1)
data.head(10)

patientid appointmentid gender scheduledday \


0 2.987250e+13 5642903 F 2016-04-29T18:38:08Z
1 5.589978e+14 5642503 M 2016-04-29T16:08:27Z
2 4.262962e+12 5642549 F 2016-04-29T16:19:04Z
3 8.679512e+11 5642828 F 2016-04-29T17:29:31Z
4 8.841186e+12 5642494 F 2016-04-29T16:07:23Z
5 9.598513e+13 5626772 F 2016-04-27T08:36:51Z
6 7.336882e+14 5630279 F 2016-04-27T15:05:12Z
7 3.449833e+12 5630575 F 2016-04-27T15:39:58Z
8 5.639473e+13 5638447 F 2016-04-29T08:02:16Z
9 7.812456e+13 5629123 F 2016-04-27T12:48:25Z

appointmentday age neighbourhood scholarship


hypertension \
0 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0
1
1 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0
0
2 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0
0
3 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0
0
4 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0
1
5 2016-04-29T00:00:00Z 76 REPÚBLICA 0
1
6 2016-04-29T00:00:00Z 23 GOIABEIRAS 0
0
7 2016-04-29T00:00:00Z 39 GOIABEIRAS 0
0
8 2016-04-29T00:00:00Z 21 ANDORINHAS 0
0
9 2016-04-29T00:00:00Z 19 CONQUISTA 0
0

diabetes alcoholism handicap sms_received no_show


0 0 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 1 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 1
7 0 0 0 0 1
8 0 0 0 0 0
9 0 0 0 0 0

# Change datetime to just date

data[['scheduledday','appointmentday']]=data[['scheduledday','appointm
entday']].apply(pd.to_datetime);
data['scheduledday'] = data['scheduledday'].dt.strftime('%d-%m-%Y');
data['appointmentday'] = data['appointmentday'].dt.strftime('%d-%m-
%Y');
data.head(10)

patientid appointmentid gender scheduledday appointmentday age


\
0 2.987250e+13 5642903 F 29-04-2016 29-04-2016 62

1 5.589978e+14 5642503 M 29-04-2016 29-04-2016 56

2 4.262962e+12 5642549 F 29-04-2016 29-04-2016 62

3 8.679512e+11 5642828 F 29-04-2016 29-04-2016 8

4 8.841186e+12 5642494 F 29-04-2016 29-04-2016 56

5 9.598513e+13 5626772 F 27-04-2016 29-04-2016 76

6 7.336882e+14 5630279 F 27-04-2016 29-04-2016 23

7 3.449833e+12 5630575 F 27-04-2016 29-04-2016 39

8 5.639473e+13 5638447 F 29-04-2016 29-04-2016 21

9 7.812456e+13 5629123 F 27-04-2016 29-04-2016 19

neighbourhood scholarship hypertension diabetes alcoholism


\
0 JARDIM DA PENHA 0 1 0 0

1 JARDIM DA PENHA 0 0 0 0
2 MATA DA PRAIA 0 0 0 0

3 PONTAL DE CAMBURI 0 0 0 0

4 JARDIM DA PENHA 0 1 1 0

5 REPÚBLICA 0 1 0 0

6 GOIABEIRAS 0 0 0 0

7 GOIABEIRAS 0 0 0 0

8 ANDORINHAS 0 0 0 0

9 CONQUISTA 0 0 0 0

handicap sms_received no_show


0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 1
7 0 0 1
8 0 0 0
9 0 0 0

# create a waiting column

data['scheduledday']=pd.to_datetime(data['scheduledday'])
data['scheduleddate']=data['scheduledday'].dt.date

data['appointmentday']=pd.to_datetime(data['appointmentday'])
data['appointmentdate']=data['appointmentday'].dt.date

data['waiting_days']=(data['appointmentdate']-
data['scheduleddate']).dt.days

data

#pd.options.mode.chained_assignment = None
gender scheduledday appointmentday age neighbourhood \
0 F 2016-04-29 2016-04-29 62 JARDIM DA PENHA
1 M 2016-04-29 2016-04-29 56 JARDIM DA PENHA
2 F 2016-04-29 2016-04-29 62 MATA DA PRAIA
3 F 2016-04-29 2016-04-29 8 PONTAL DE CAMBURI
4 F 2016-04-29 2016-04-29 56 JARDIM DA PENHA
... ... ... ... ... ...
110522 F 2016-03-05 2016-07-06 56 MARIA ORTIZ
110523 F 2016-03-05 2016-07-06 51 MARIA ORTIZ
110524 F 2016-04-27 2016-07-06 21 MARIA ORTIZ
110525 F 2016-04-27 2016-07-06 38 MARIA ORTIZ
110526 F 2016-04-27 2016-07-06 54 MARIA ORTIZ

scholarship hypertension diabetes alcoholism handicap \


0 0 1 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 1 1 0 0
... ... ... ... ... ...
110522 0 0 0 0 0
110523 0 0 0 0 0
110524 0 0 0 0 0
110525 0 0 0 0 0
110526 0 0 0 0 0

sms_received no_show scheduleddate appointmentdate


waiting_days \
0 0 0 2016-04-29 2016-04-29
0
1 0 0 2016-04-29 2016-04-29
0
2 0 0 2016-04-29 2016-04-29
0
3 0 0 2016-04-29 2016-04-29
0
4 0 0 2016-04-29 2016-04-29
0
... ... ... ... ...
...
110522 1 0 2016-03-05 2016-07-06
123
110523 1 0 2016-03-05 2016-07-06
123
110524 1 0 2016-04-27 2016-07-06
70
110525 1 0 2016-04-27 2016-07-06
70
110526 1 0 2016-04-27 2016-07-06
70
agerange
0 59-79
1 40-59
2 59-79
3 0-19
4 40-59
... ...
110522 40-59
110523 40-59
110524 20-39
110525 20-39
110526 40-59

[110526 rows x 16 columns]

data.query('waiting_days<0')
data['waiting_days'][data.waiting_days<0]=0

# check for duplicates so it can be dropped

data.duplicated().sum() ## this shows there are no duplicates

# drop columns not necessary for our analysis

data.drop(['patientid','appointmentid'],axis = 1, inplace = True)

# check if there are rows with age less than o

data.query('age<0') # one row was found with age = -1

gender scheduledday appointmentday age neighbourhood


scholarship \
99832 F 2016-06-06 2016-06-06 -1 ROMÃO
0

hypertension diabetes alcoholism handicap sms_received


no_show \
99832 0 0 0 0 0
0

scheduleddate appointmentdate waiting_days


99832 2016-06-06 2016-06-06 0

# drop row with age less than 0

#wrong_age = data[data['age']== -1].index # alternative way to drop


the row
#data.drop(wrong_age, inplace = True)

data.drop(data.query('age<0').index, inplace=True)

# confirm row with age less than 0 has been dropped

data.query('age<0') # age less than zero row has been removed

Empty DataFrame
Columns: [gender, scheduledday, appointmentday, age, neighbourhood,
scholarship, hypertension, diabetes, alcoholism, handicap,
sms_received, no_show, scheduleddate, appointmentdate, waiting_days]
Index: []

Exploratory Data Analysis

# generally assessing all numerical columns

data.hist(figsize=(8,8), color='black');
# assessing the correlation of the dataset columns

data_corr = data.corr()
data_corr.style.background_gradient(cmap='coolwarm',axis=None)

<pandas.io.formats.style.Styler at 0x2dfa1d2a130>

# using heatmap to check correlation

f, ax=plt.subplots(figsize=(10,9))
data_corr = data.corr()
sns.heatmap(data_corr,annot=True)
<AxesSubplot:>

# total number of patients that attended or did not attend their


appointment in percentage

patients = data.no_show.value_counts()
print(patients)

plt.pie(patients, labels = ['Attended', 'Did Not Attend'], autopct =


'%1.1f%%', explode = (0, 0.07), colors =['red','black']);
plt.title('Percentage of patients')
plt.legend();
0 88207
1 22319
Name: no_show, dtype: int64

1) Does gender affect if a person will show up for there scheduled appointment?

# the mean of genders that show up or not against health conditions

pv=pd.pivot_table(data,index=['gender','no_show'])
pv

age alcoholism diabetes handicap


hypertension \
gender no_show

F 0 39.591126 0.015984 0.080164 0.019792


0.221539
1 36.162190 0.021105 0.069686 0.018569
0.182061
M 0 34.461372 0.057102 0.062141 0.028196
0.172696
1 30.833010 0.047767 0.053463 0.023560
0.144337

scholarship sms_received waiting_days


gender no_show
F 0 0.117862 0.305389 26.900009
1 0.144306 0.460463 43.356516
M 0 0.049609 0.265358 24.436018
1 0.061100 0.396634 40.897735

# total number of patients by their gender

data.gender.value_counts().plot(kind = 'bar',color = ['red', 'black'])


plt.ylabel('Number of patients')
plt.xlabel('Gender')
plt.title('Number of Patients by Gender')
plt.show();

#category of genders that showed up or not

gender_count = data.gender.value_counts()
print(gender_count)

gender_no_show_count = data.groupby('gender').no_show.value_counts()
print(gender_no_show_count)

data.groupby('gender').no_show.value_counts().plot(kind='barh',
color=['red']);
plt.title('Gender Appointment Status')
plt.xlabel('Gender')
plt.show()

F 71839
M 38687
Name: gender, dtype: int64
gender no_show
F 0 57245
1 14594
M 0 30962
1 7725
Name: no_show, dtype: int64

2) Does age affect if a person will show up for there scheduled appointment?
#chech for outlier for age

sns.boxplot(data=data, x='age', color='black');


# using age range instead of actual age
#code for converting age to age range for this dataset

def age_range(x):
if x<20:
return '0-19'
elif x<40:
return '20-39'
elif x<60:
return '40-59'
elif x<80:
return '59-79'
elif x>=80:
return '80+'
else:
return 'other'
data['agerange']= data.age.apply(age_range)
data

gender scheduledday appointmentday age neighbourhood \


0 F 2016-04-29 2016-04-29 62 JARDIM DA PENHA
1 M 2016-04-29 2016-04-29 56 JARDIM DA PENHA
2 F 2016-04-29 2016-04-29 62 MATA DA PRAIA
3 F 2016-04-29 2016-04-29 8 PONTAL DE CAMBURI
4 F 2016-04-29 2016-04-29 56 JARDIM DA PENHA
... ... ... ... ... ...
110522 F 2016-03-05 2016-07-06 56 MARIA ORTIZ
110523 F 2016-03-05 2016-07-06 51 MARIA ORTIZ
110524 F 2016-04-27 2016-07-06 21 MARIA ORTIZ
110525 F 2016-04-27 2016-07-06 38 MARIA ORTIZ
110526 F 2016-04-27 2016-07-06 54 MARIA ORTIZ

scholarship hypertension diabetes alcoholism handicap \


0 0 1 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 1 1 0 0
... ... ... ... ... ...
110522 0 0 0 0 0
110523 0 0 0 0 0
110524 0 0 0 0 0
110525 0 0 0 0 0
110526 0 0 0 0 0

sms_received no_show scheduleddate appointmentdate


waiting_days \
0 0 0 2016-04-29 2016-04-29
0
1 0 0 2016-04-29 2016-04-29
0
2 0 0 2016-04-29 2016-04-29
0
3 0 0 2016-04-29 2016-04-29
0
4 0 0 2016-04-29 2016-04-29
0
... ... ... ... ...
...
110522 1 0 2016-03-05 2016-07-06
123
110523 1 0 2016-03-05 2016-07-06
123
110524 1 0 2016-04-27 2016-07-06
70
110525 1 0 2016-04-27 2016-07-06
70
110526 1 0 2016-04-27 2016-07-06
70

agerange
0 59-79
1 40-59
2 59-79
3 0-19
4 40-59
... ...
110522 40-59
110523 40-59
110524 20-39
110525 20-39
110526 40-59

[110526 rows x 16 columns]

# Number of ages
# it shows that majority of patients are between 0-60 years old

data.agerange.value_counts().plot(kind = 'bar',color = ['red',


'black'])
plt.ylabel('Number of patients')
plt.xlabel('agerange')
plt.title('Number of Patients by Age')
plt.show();

# age againdt appointment status


# it shows that patients from 0-60 years old showed up for their
appointments

age_count = data.agerange.value_counts();
print(age_count);

age_no_show_count = data.groupby('agerange').no_show.value_counts()
print(age_no_show_count);

data.groupby('agerange').no_show.value_counts().plot(kind='bar',
color=['red']);
plt.title('Age Appointment Status')
plt.xlabel('Age')
plt.show()

0-19 30411
40-59 30072
20-39 28870
59-79 17810
80+ 3363
Name: agerange, dtype: int64
agerange no_show
0-19 0 23670
1 6741
20-39 0 22190
1 6680
40-59 0 24416
1 5656
59-79 0 15118
1 2692
80+ 0 2813
1 550
Name: no_show, dtype: int64
3) Does having a scholarship affect a patient showing up for his/her appointment?

# Number of patients with scholarship that attended or not


sch= pd.pivot_table(data=data,
index=['scholarship'],columns='no_show',values=['age'],aggfunc=['count
'])
print(sch)

sch.plot(kind='bar',color=['red','black'])
plt.legend(title='Status',labels=['Attended','Did not Attend']);

count
age
no_show 0 1
scholarship
0 79924 19741
1 8283 2578
4) Does having a certain ailment affect if a patient will show up or not?

# using subplots to assess patients with or without health challenges


that showed up for their appointment or did not.

plt.subplot(2,2,1)
hypertension=data['hypertension'].map({0:'No',1:'Yes'})
sns.countplot(hypertension,
data=data,hue='no_show',palette=['red','black'])
plt.title('Hypertension status')
plt.legend(title='Show',labels=['Attended','Did not Attend'])

plt.subplot(2,2,2)
handicap=data['handicap']
sns.countplot(handicap,
data=data,hue='no_show',palette=['red','black'])
plt.title('Handicap status')
plt.legend(title='Show',labels=['Attended','Did not Attend'])

plt.subplot(2,2,3)
diabetes=data['diabetes'].map({0:'No',1:'Yes'})
sns.countplot(diabetes,
data=data,hue='no_show',palette=['red','black'])
plt.title('Diabetes status')
plt.legend(title='Show',labels=['Attended','Did not Attend'])
plt.subplot(2,2,4)
alcoholism=data['alcoholism'].map({0:'No',1:'Yes'})
sns.countplot(diabetes,
data=data,hue='no_show',palette=['red','black'])
plt.title('Alcoholism status')
plt.legend(title='Show',labels=['Attended','Did not Attend'])

plt.subplots_adjust(left=0,right=1.5,bottom=0,top=1.5,wspace=0.3,hspac
e=0.3)

pd.options.mode.chained_assignment = None

C:\Users\Daisy Dickson\anaconda3\lib\site-packages\seaborn\
_decorators.py:36: FutureWarning: Pass the following variable as a
keyword arg: x. From version 0.12, the only valid positional argument
will be `data`, and passing other arguments without an explicit
keyword will result in an error or misinterpretation.
warnings.warn(
C:\Users\Daisy Dickson\anaconda3\lib\site-packages\seaborn\
_decorators.py:36: FutureWarning: Pass the following variable as a
keyword arg: x. From version 0.12, the only valid positional argument
will be `data`, and passing other arguments without an explicit
keyword will result in an error or misinterpretation.
warnings.warn(
C:\Users\Daisy Dickson\anaconda3\lib\site-packages\seaborn\
_decorators.py:36: FutureWarning: Pass the following variable as a
keyword arg: x. From version 0.12, the only valid positional argument
will be `data`, and passing other arguments without an explicit
keyword will result in an error or misinterpretation.
warnings.warn(
C:\Users\Daisy Dickson\anaconda3\lib\site-packages\seaborn\
_decorators.py:36: FutureWarning: Pass the following variable as a
keyword arg: x. From version 0.12, the only valid positional argument
will be `data`, and passing other arguments without an explicit
keyword will result in an error or misinterpretation.
warnings.warn(
5) How often do women show up as against men?

# gender against appointment status

data.groupby('no_show').gender.value_counts().plot(kind='area',
color=['red']);
plt.title('Gender That Keep To Appointment')
plt.xlabel('Alcoholism/Gender')
plt.show()
Other Findings

# gender againts health challenges

data.groupby('hypertension').gender.value_counts().plot(kind='barh',
color=['red']);
plt.title('Gender vs Hypertension')
plt.xlabel('Hypertension/Gender')
plt.show()

data.groupby('diabetes').gender.value_counts().plot(kind='barh',
color=['black']);
plt.title('Gender vs Diabetes')
plt.xlabel('Diabetes/Gender')
plt.show()

data.groupby('alcoholism').gender.value_counts().plot(kind='barh',
color=['red']);
plt.title('Gender vs Alcoholism')
plt.xlabel('Alcoholism/Gender')
plt.show()

data.groupby('handicap').gender.value_counts().plot(kind='barh',
color=['black']);
plt.title('Gender vs Handicap')
plt.xlabel('Handicap/Gender')
plt.show()
#data.groupby('gender')['hypertension'].mean().plot(kind='bar')
# this plot shows that majority received sms
sms = data.sms_received.value_counts()
print(sms)

plt.pie(patients, labels = ['received', 'not received'], autopct =


'%1.1f%%', explode = (0, 0.07), colors =['red','black']);
plt.title('Percentage of sms')
plt.legend();

0 75044
1 35482
Name: sms_received, dtype: int64

# majority of patients that received sms attended their apointment

sms_received= pd.pivot_table(data=data,
index=['sms_received'],columns='no_show',values=['age'],aggfunc=['coun
t'])
print(sms_received)

sms_received.plot(kind='bar',color=['red','black'])
plt.legend(title='Status',labels=['Attended','did not attend']);

count
age
no_show 0 1
sms_received
0 62509 12535
1 25698 9784
# gender with highest health challenges

# hypertension

hyp= pd.pivot_table(data=data,
index=['hypertension'],columns='gender',values=['age'],aggfunc=['count
'])
print(hyp)

hyp.plot(kind='bar', stacked=True, color=['red','black']);


plt.legend(title='Status',labels=['Female','Male']);

# diabetes

dia= pd.pivot_table(data=data,
index=['diabetes'],columns='gender',values=['age'],aggfunc=['count'])
print(dia)

dia.plot(kind='bar',stacked=True, color=['red','black']);
plt.legend(title='Status',labels=['Female','Male']);

# handicap
han= pd.pivot_table(data=data,
index=['handicap'],columns='gender',values=['age'],aggfunc=['count'])
print(han)

han.plot(kind='bar',stacked=True, color=['red','black']);
plt.legend(title='Status',labels=['Female','Male']);

# alcoholism
ach= pd.pivot_table(data=data,
index=['alcoholism'],columns='gender',values=['age'],aggfunc=['count']
)
print(ach)

ach.plot(kind='bar',stacked=True, color=['red','black']);
plt.legend(title='Status',labels=['Female','Male']);

count
age
gender F M
hypertension
0 56500 32225
1 15339 6462
count
age
gender F M
diabetes
0 66233 36350
1 5606 2337
count
age
gender F M
handicap
0 70549 37736
1 1181 861
2 105 78
3 3 10
4 1 2
count
age
gender F M
alcoholism
0 70616 36550
1 1223 2137
wait= data.query('waiting_days<10')
wait.groupby('waiting_days')
['no_show'].value_counts().plot(kind='bar')

#wait=data.groupby('waiting_days').value_counts()
#plt.plot(wait);

<AxesSubplot:xlabel='waiting_days,no_show'>
Conclusion
• The dataset records more of female patients to male and it shows that females
showed up more for their appointment.
• There is 50% correlation between age and hypertension. Majority of patients
between 0-60 years old showed up for their appointment
• Another unexpected outcome is majority of patients without scholarship showed up
for their appointment
• The rate of patients who did not receive sms, yet showed up is high which wasn't
expected
• More women suffer from hypertension and diabetes compared to men

Limitations
• This dataset is insufficient to draw unbiased conclusion. It contains information of
just 3 months in 2016
• The column 'No_show' values of 'no - showed up' and 'yes - did not show up' was
very confusing. Took time to understand
• There are outliers and age that is less than 0. Rows with less than 0 was removed

You might also like