Professional Documents
Culture Documents
Doctors Appointment
Doctors Appointment
INTRODUCTION
This dataset contains over 100k information of medical appointment of patients in Brazil
and the focus is to investigate the factors that affects patients attendance to their
appointment. We are to explore and determine what could make a patient not attend
his/her appointment. The dataset consist of 14 columns and below are the columns and
their descriptions.
Data Dictionary
Column Description
#### Questions
1) Does gender affect if a person will show up for there scheduled
appointment?
2) Does age affect if a person will show up for there scheduled
appointment?
3) Does having a scholarship affect a patient showing up for his/her
appointment?
4) Does having a certain ailment affect if a patient will show up or
not?
5) How often do men show up as against women?
%matplotlib inline
data.shape
(110527, 14)
# The 'info' function shows the datatypes, if there are null values
and its count.
# This helps us know if there are missing values.
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PatientId 110527 non-null float64
1 AppointmentID 110527 non-null int64
2 Gender 110527 non-null object
3 ScheduledDay 110527 non-null object
4 AppointmentDay 110527 non-null object
5 Age 110527 non-null int64
6 Neighbourhood 110527 non-null object
7 Scholarship 110527 non-null int64
8 Hipertension 110527 non-null int64
9 Diabetes 110527 non-null int64
10 Alcoholism 110527 non-null int64
11 Handcap 110527 non-null int64
12 SMS_received 110527 non-null int64
13 No-show 110527 non-null object
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB
# This is another way to check for missing values. From the output, it
shows there are no missing values for each column
data.isnull().sum()
PatientId 0
AppointmentID 0
Gender 0
ScheduledDay 0
AppointmentDay 0
Age 0
Neighbourhood 0
Scholarship 0
Hipertension 0
Diabetes 0
Alcoholism 0
Handcap 0
SMS_received 0
No-show 0
dtype: int64
data.describe()
SMS_received
count 110527.000000
mean 0.321026
std 0.466873
min 0.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 1.000000
data.nunique()
PatientId 62299
AppointmentID 110527
Gender 2
ScheduledDay 103549
AppointmentDay 27
Age 104
Neighbourhood 81
Scholarship 2
Hipertension 2
Diabetes 2
Alcoholism 2
Handcap 5
SMS_received 2
No-show 2
dtype: int64
print('Gender', data.Gender.unique())
print('Scholarship', data.Scholarship.unique())
print('Handcap', data.Handcap.unique())
print('No-show', data['No-show'].unique())
Data Wrangling
# Renaming of some columns because of typos
data.columns = data.columns.str.lower()
data.columns
data[['scheduledday','appointmentday']]=data[['scheduledday','appointm
entday']].apply(pd.to_datetime);
data['scheduledday'] = data['scheduledday'].dt.strftime('%d-%m-%Y');
data['appointmentday'] = data['appointmentday'].dt.strftime('%d-%m-
%Y');
data.head(10)
1 JARDIM DA PENHA 0 0 0 0
2 MATA DA PRAIA 0 0 0 0
3 PONTAL DE CAMBURI 0 0 0 0
4 JARDIM DA PENHA 0 1 1 0
5 REPÚBLICA 0 1 0 0
6 GOIABEIRAS 0 0 0 0
7 GOIABEIRAS 0 0 0 0
8 ANDORINHAS 0 0 0 0
9 CONQUISTA 0 0 0 0
data['scheduledday']=pd.to_datetime(data['scheduledday'])
data['scheduleddate']=data['scheduledday'].dt.date
data['appointmentday']=pd.to_datetime(data['appointmentday'])
data['appointmentdate']=data['appointmentday'].dt.date
data['waiting_days']=(data['appointmentdate']-
data['scheduleddate']).dt.days
data
#pd.options.mode.chained_assignment = None
gender scheduledday appointmentday age neighbourhood \
0 F 2016-04-29 2016-04-29 62 JARDIM DA PENHA
1 M 2016-04-29 2016-04-29 56 JARDIM DA PENHA
2 F 2016-04-29 2016-04-29 62 MATA DA PRAIA
3 F 2016-04-29 2016-04-29 8 PONTAL DE CAMBURI
4 F 2016-04-29 2016-04-29 56 JARDIM DA PENHA
... ... ... ... ... ...
110522 F 2016-03-05 2016-07-06 56 MARIA ORTIZ
110523 F 2016-03-05 2016-07-06 51 MARIA ORTIZ
110524 F 2016-04-27 2016-07-06 21 MARIA ORTIZ
110525 F 2016-04-27 2016-07-06 38 MARIA ORTIZ
110526 F 2016-04-27 2016-07-06 54 MARIA ORTIZ
data.query('waiting_days<0')
data['waiting_days'][data.waiting_days<0]=0
data.drop(data.query('age<0').index, inplace=True)
Empty DataFrame
Columns: [gender, scheduledday, appointmentday, age, neighbourhood,
scholarship, hypertension, diabetes, alcoholism, handicap,
sms_received, no_show, scheduleddate, appointmentdate, waiting_days]
Index: []
data.hist(figsize=(8,8), color='black');
# assessing the correlation of the dataset columns
data_corr = data.corr()
data_corr.style.background_gradient(cmap='coolwarm',axis=None)
<pandas.io.formats.style.Styler at 0x2dfa1d2a130>
f, ax=plt.subplots(figsize=(10,9))
data_corr = data.corr()
sns.heatmap(data_corr,annot=True)
<AxesSubplot:>
patients = data.no_show.value_counts()
print(patients)
1) Does gender affect if a person will show up for there scheduled appointment?
pv=pd.pivot_table(data,index=['gender','no_show'])
pv
gender_count = data.gender.value_counts()
print(gender_count)
gender_no_show_count = data.groupby('gender').no_show.value_counts()
print(gender_no_show_count)
data.groupby('gender').no_show.value_counts().plot(kind='barh',
color=['red']);
plt.title('Gender Appointment Status')
plt.xlabel('Gender')
plt.show()
F 71839
M 38687
Name: gender, dtype: int64
gender no_show
F 0 57245
1 14594
M 0 30962
1 7725
Name: no_show, dtype: int64
2) Does age affect if a person will show up for there scheduled appointment?
#chech for outlier for age
def age_range(x):
if x<20:
return '0-19'
elif x<40:
return '20-39'
elif x<60:
return '40-59'
elif x<80:
return '59-79'
elif x>=80:
return '80+'
else:
return 'other'
data['agerange']= data.age.apply(age_range)
data
agerange
0 59-79
1 40-59
2 59-79
3 0-19
4 40-59
... ...
110522 40-59
110523 40-59
110524 20-39
110525 20-39
110526 40-59
# Number of ages
# it shows that majority of patients are between 0-60 years old
age_count = data.agerange.value_counts();
print(age_count);
age_no_show_count = data.groupby('agerange').no_show.value_counts()
print(age_no_show_count);
data.groupby('agerange').no_show.value_counts().plot(kind='bar',
color=['red']);
plt.title('Age Appointment Status')
plt.xlabel('Age')
plt.show()
0-19 30411
40-59 30072
20-39 28870
59-79 17810
80+ 3363
Name: agerange, dtype: int64
agerange no_show
0-19 0 23670
1 6741
20-39 0 22190
1 6680
40-59 0 24416
1 5656
59-79 0 15118
1 2692
80+ 0 2813
1 550
Name: no_show, dtype: int64
3) Does having a scholarship affect a patient showing up for his/her appointment?
sch.plot(kind='bar',color=['red','black'])
plt.legend(title='Status',labels=['Attended','Did not Attend']);
count
age
no_show 0 1
scholarship
0 79924 19741
1 8283 2578
4) Does having a certain ailment affect if a patient will show up or not?
plt.subplot(2,2,1)
hypertension=data['hypertension'].map({0:'No',1:'Yes'})
sns.countplot(hypertension,
data=data,hue='no_show',palette=['red','black'])
plt.title('Hypertension status')
plt.legend(title='Show',labels=['Attended','Did not Attend'])
plt.subplot(2,2,2)
handicap=data['handicap']
sns.countplot(handicap,
data=data,hue='no_show',palette=['red','black'])
plt.title('Handicap status')
plt.legend(title='Show',labels=['Attended','Did not Attend'])
plt.subplot(2,2,3)
diabetes=data['diabetes'].map({0:'No',1:'Yes'})
sns.countplot(diabetes,
data=data,hue='no_show',palette=['red','black'])
plt.title('Diabetes status')
plt.legend(title='Show',labels=['Attended','Did not Attend'])
plt.subplot(2,2,4)
alcoholism=data['alcoholism'].map({0:'No',1:'Yes'})
sns.countplot(diabetes,
data=data,hue='no_show',palette=['red','black'])
plt.title('Alcoholism status')
plt.legend(title='Show',labels=['Attended','Did not Attend'])
plt.subplots_adjust(left=0,right=1.5,bottom=0,top=1.5,wspace=0.3,hspac
e=0.3)
pd.options.mode.chained_assignment = None
C:\Users\Daisy Dickson\anaconda3\lib\site-packages\seaborn\
_decorators.py:36: FutureWarning: Pass the following variable as a
keyword arg: x. From version 0.12, the only valid positional argument
will be `data`, and passing other arguments without an explicit
keyword will result in an error or misinterpretation.
warnings.warn(
C:\Users\Daisy Dickson\anaconda3\lib\site-packages\seaborn\
_decorators.py:36: FutureWarning: Pass the following variable as a
keyword arg: x. From version 0.12, the only valid positional argument
will be `data`, and passing other arguments without an explicit
keyword will result in an error or misinterpretation.
warnings.warn(
C:\Users\Daisy Dickson\anaconda3\lib\site-packages\seaborn\
_decorators.py:36: FutureWarning: Pass the following variable as a
keyword arg: x. From version 0.12, the only valid positional argument
will be `data`, and passing other arguments without an explicit
keyword will result in an error or misinterpretation.
warnings.warn(
C:\Users\Daisy Dickson\anaconda3\lib\site-packages\seaborn\
_decorators.py:36: FutureWarning: Pass the following variable as a
keyword arg: x. From version 0.12, the only valid positional argument
will be `data`, and passing other arguments without an explicit
keyword will result in an error or misinterpretation.
warnings.warn(
5) How often do women show up as against men?
data.groupby('no_show').gender.value_counts().plot(kind='area',
color=['red']);
plt.title('Gender That Keep To Appointment')
plt.xlabel('Alcoholism/Gender')
plt.show()
Other Findings
data.groupby('hypertension').gender.value_counts().plot(kind='barh',
color=['red']);
plt.title('Gender vs Hypertension')
plt.xlabel('Hypertension/Gender')
plt.show()
data.groupby('diabetes').gender.value_counts().plot(kind='barh',
color=['black']);
plt.title('Gender vs Diabetes')
plt.xlabel('Diabetes/Gender')
plt.show()
data.groupby('alcoholism').gender.value_counts().plot(kind='barh',
color=['red']);
plt.title('Gender vs Alcoholism')
plt.xlabel('Alcoholism/Gender')
plt.show()
data.groupby('handicap').gender.value_counts().plot(kind='barh',
color=['black']);
plt.title('Gender vs Handicap')
plt.xlabel('Handicap/Gender')
plt.show()
#data.groupby('gender')['hypertension'].mean().plot(kind='bar')
# this plot shows that majority received sms
sms = data.sms_received.value_counts()
print(sms)
0 75044
1 35482
Name: sms_received, dtype: int64
sms_received= pd.pivot_table(data=data,
index=['sms_received'],columns='no_show',values=['age'],aggfunc=['coun
t'])
print(sms_received)
sms_received.plot(kind='bar',color=['red','black'])
plt.legend(title='Status',labels=['Attended','did not attend']);
count
age
no_show 0 1
sms_received
0 62509 12535
1 25698 9784
# gender with highest health challenges
# hypertension
hyp= pd.pivot_table(data=data,
index=['hypertension'],columns='gender',values=['age'],aggfunc=['count
'])
print(hyp)
# diabetes
dia= pd.pivot_table(data=data,
index=['diabetes'],columns='gender',values=['age'],aggfunc=['count'])
print(dia)
dia.plot(kind='bar',stacked=True, color=['red','black']);
plt.legend(title='Status',labels=['Female','Male']);
# handicap
han= pd.pivot_table(data=data,
index=['handicap'],columns='gender',values=['age'],aggfunc=['count'])
print(han)
han.plot(kind='bar',stacked=True, color=['red','black']);
plt.legend(title='Status',labels=['Female','Male']);
# alcoholism
ach= pd.pivot_table(data=data,
index=['alcoholism'],columns='gender',values=['age'],aggfunc=['count']
)
print(ach)
ach.plot(kind='bar',stacked=True, color=['red','black']);
plt.legend(title='Status',labels=['Female','Male']);
count
age
gender F M
hypertension
0 56500 32225
1 15339 6462
count
age
gender F M
diabetes
0 66233 36350
1 5606 2337
count
age
gender F M
handicap
0 70549 37736
1 1181 861
2 105 78
3 3 10
4 1 2
count
age
gender F M
alcoholism
0 70616 36550
1 1223 2137
wait= data.query('waiting_days<10')
wait.groupby('waiting_days')
['no_show'].value_counts().plot(kind='bar')
#wait=data.groupby('waiting_days').value_counts()
#plt.plot(wait);
<AxesSubplot:xlabel='waiting_days,no_show'>
Conclusion
• The dataset records more of female patients to male and it shows that females
showed up more for their appointment.
• There is 50% correlation between age and hypertension. Majority of patients
between 0-60 years old showed up for their appointment
• Another unexpected outcome is majority of patients without scholarship showed up
for their appointment
• The rate of patients who did not receive sms, yet showed up is high which wasn't
expected
• More women suffer from hypertension and diabetes compared to men
Limitations
• This dataset is insufficient to draw unbiased conclusion. It contains information of
just 3 months in 2016
• The column 'No_show' values of 'no - showed up' and 'yes - did not show up' was
very confusing. Took time to understand
• There are outliers and age that is less than 0. Rows with less than 0 was removed