Assignment 05-02

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

Assignment 05 – 02 Cloud Computing Student No: N01604682

1)

I’ve selected a data set which predicts whether the person will develop a heart condition over the
span of 10 years based on several different features (independent variables).
There are a total of 16 columns in the dataset with missing values in a few columns.
0 male – This is an independent variable with binary values which depicts whether the individual
is a male or not.
1 age – Age feature is an independent variable which displays the current age of the individual.
2 education – This column denotes the GPA of the individaul.
3 currentSmoker – This feature has binary values and tells whether the person is a current
smoker or not.
4 cigsPerDay – This column denotes the number of cigarette an individual has in one day.
5 BPMeds - This feature again uses binary values to denote whether the person takes any
medication for Blood pressure or not
6 prevalentStroke – This feature helps us understand that did the person have had any strokes in
the past. This feature also has binary values.
7 prevalentHyp and 8 diabetes – The following independent variables has binary values and can
be used to understand that if the individual has any previous case of hypertension and diabetes.
9 totChol – This attribute of the dataset helps to understand the total cholesterol level of the
person.
10 sysBP and 11 diaBP – These independent features denote the systolic and diastolic blood
pressure levels of the individual.
12 BMI, 13 heartRate and 14 glucose – These columns of the dataset denote the Body Mass
Index (BMI) of the person, their heartrate and their glucose levels.
15 TenYearCHD – This is a dependent variable whose value depends on the several independent
variables mentioned above and is denoted in binary format i.e. 0 – The person will most likely
not develop any heart condition and 1 The person will most likely to develop a heart condition.

2)

From the above summary it is clear that only 1.01% of data is missing and overall, the data
seems to be without any duplicate values.
3)

According to my understanding from analysing the dataset, the education column did not add
much importance to the dataset and to the final dependent variable, so I dropped the column
“education” from the dataset.

4)
In the next step there were a few missing values in some or the other way in the columns BMI,
cigsPerDay, currentSmoker, totChol, gulucose and heartrate. Since these values are important in
deciding the output of the final dependent variable therefore, the best approach is to impute the
mean in place of those missing values.

5)

Here in this step, there were still a few missing values in the BPMeds column but since they
were in a binary format the approach here is fill the missing placeholders with the Approximate
mean of the column.
6)
Finally, it can be seen that the data is clean and does not contain any missing values. Which will
help generate accurate results while actually working with the data.

You might also like