Emilio Garvin - Data Management

CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
ASSIGNMENT
TECHNOLOGY PARK MALAYSIA
CT075-3-2-DTM
DATA MANAGEMENT
DATA PRE-PROCESSING
APD2F2209CSDA
HAND OUT DATE: 10 OCTOBER 2022
HAND IN DATE: 30 DESEMBER 2022
WEIGHTAGE: 25%
NAME : Emilio Garvin
TP NUMBER : TP062977
INSTRUCTIONS TO CANDIDATES:
1 Submit your assignment at the administrative counter.
2 Students are advised to underpin their answers with the use of references
(cited using the American Psychological Association (APA) Referencing).
3 Late submission will be awarded zero (0) unless Extenuating

Circumstances (EC) are upheld.
4 Cases of plagiarism will be penalized.
5 The assignment should be bound in an appropriate style (comb bound or

stapled).
6 Where the assignment should be submitted in both hardcopy and

softcopy, the softcopy of the written assignment and source code (where
1|Page
appropriate) should be on a CD in an envelope / CD cover and attached to

the hardcopy.
7 You must obtain 50% overall to pass this module.
2|Page
Table of Contents
1. Introduction and Data Set ............................................................................................................... 6
1.1 Sample of the Initial Data Set ................................................................................................. 6
1.2 Data Type ................................................................................................................................ 7
2. Exploratory Data Analysis (Before Pre-processing) ........................................................................ 7
2.1 Descriptive statistics ............................................................................................................... 7
2.1.1 Status .............................................................................................................................. 7
2.1.2 DeathCause ..................................................................................................................... 8
2.1.3 AgeCHDdiag..................................................................................................................... 8
2.1.4 Sex ................................................................................................................................... 9
2.1.5 AgeAtStart ....................................................................................................................... 9
2.1.6 Height .............................................................................................................................. 9
2.1.7 Weight ............................................................................................................................. 9
2.1.8 Diastolic ......................................................................................................................... 10
2.1.9 Systolic .......................................................................................................................... 10
2.1.10 MRW ............................................................................................................................. 10
2.1.11 Smoking ......................................................................................................................... 10
2.1.12 AgeAtDeath ................................................................................................................... 11
2.1.13 Cholesterol .................................................................................................................... 11
2.1.14 Chol_Status ................................................................................................................... 11
2.1.15 BP_Status ...................................................................................................................... 12
2.1.16 Weight_Status ............................................................................................................... 12
2.1.17 Smoking_Status............................................................................................................. 12
2.2 Univariate graph for each column ........................................................................................ 13
2.2.1 Status ............................................................................................................................ 13
2.2.2 DeathCause ................................................................................................................... 13
2.2.3 AgeCHDdiag................................................................................................................... 15
2.2.4 Sex ................................................................................................................................. 16
2.2.5 AgeAtStart ..................................................................................................................... 17
2.2.6 Height ............................................................................................................................ 18
2.2.7 Weight ........................................................................................................................... 19
2.2.8 Diastolic ......................................................................................................................... 20
2.2.9 Systolic .......................................................................................................................... 21
3|Page
2.2.10 MRW ............................................................................................................................. 22

2.2.11 Smoking ......................................................................................................................... 23
2.2.12 AgeAtDeath ................................................................................................................... 24
2.2.13 Cholesterol .................................................................................................................... 25
2.2.14 Chol_Status ................................................................................................................... 26
2.2.15 BP_Status ...................................................................................................................... 26
2.2.16 Weight_Status ............................................................................................................... 27
2.2.17 Smoking_Status............................................................................................................. 27
3. Data Pre-processing ...................................................................................................................... 28
3.1 Incomplete Data.................................................................................................................... 28
3.1.1 Showing the list of missing value in the dataset ........................................................... 28
3.1.2 Replacing the missing value in DeathCause .................................................................. 31
3.1.3 Replacing the missing value in the Height by the mean and standard deviation ......... 31
3.1.4 Replacing the missing value in Weight by the mean and standard deviation .............. 32
3.1.5 Replacing the missing value in MRW by the mean and standard deviation................. 33
3.1.6 Replacing the missing value in AgeCHDdiag with 0 ...................................................... 33
3.1.7 Replacing the missing value in Smoking with the Mode Value .................................... 34
3.1.8 Replacing the missing value in Smoking_Status with the Mode Value ........................ 34
3.1.9 Replacing the missing value in AgeAtDeath with 0 if Alive ........................................... 35
3.1.10 Replacing the missing value in Cholesterol by the mean and standard deviation ....... 36
3.1.11 Replacing the missing value in Chol_Status based on the Cholesterol......................... 36
3.1.12 Replacing the missing value in Weight_Status based on the MRW ............................. 37
3.1.13 Data with no missing value ........................................................................................... 39
3.2 Noisy Data ............................................................................................................................. 41
3.2.1 Age CHD Diagnosed Without the 0 Value ..................................................................... 42
3.2.2 Age At Start ................................................................................................................... 43
3.2.3 Height ............................................................................................................................ 44
3.2.4 Weight ........................................................................................................................... 45
3.2.5 Diastolic ......................................................................................................................... 48
3.2.6 Systolic .......................................................................................................................... 50
3.2.7 MRW ............................................................................................................................. 52
3.2.8 Smoking ......................................................................................................................... 54
3.2.9 Age At Death Without the 0 Value................................................................................ 56
3.2.10 Cholesterol .................................................................................................................... 57
3.2.11 Explanation ................................................................................................................... 58
3.3 Inconsistent Data .................................................................................................................. 58
4|Page
3.1 Data Transformation ............................................................................................................. 61

3.1.1 Adding a unique ID for each row .................................................................................. 61
3.1.2 Adding a new column for the range of Age CHD Diagnosed ........................................ 62
3.1.3 Adding a new column for the range of Age At Start ..................................................... 63
3.1.4 Adding a new column for the range of Age At Death ................................................... 64
4. Visualization and Analysis of the Processed Data......................................................................... 65
4.1 Final Data Set ........................................................................................................................ 65
4.2 DeathCause by Status ........................................................................................................... 67
4.3 AgeStartStat .......................................................................................................................... 68
4.4 AgeCHDStat ........................................................................................................................... 69
4.5 AgeDeathStat ........................................................................................................................ 70
4.6 DeathCause group by AgeDeathStat..................................................................................... 71
4.7 Number of patients with CHD diagnosed group by Status ................................................... 72
4.8 Height and Weight Group By Sex .......................................................................................... 73
4.9 Weight Status Group by Sex ................................................................................................. 74
4.10 Systolic and Diastolic Group By Blood Pressure Status......................................................... 75
4.11 Correlation between Height, Weight and MRW ................................................................... 75
4.12 Weight and MRW Group by Sex ........................................................................................... 76
4.13 Height and Weight Group by Weight Status......................................................................... 77
4.14 Cholesterol Level Compared to Age of Death ....................................................................... 78
4.15 Smoking Level and Cholesterol Compared to Age of Death ................................................. 79
4.16 Relationship between blood pressure, Cholesterol, and MRW group by Sex ...................... 79
5. Conclusion ..................................................................................................................................... 80
6. References .................................................................................................................................... 81
5|Page
1. Introduction and Data Set

The following paper will show in details the pre-processing and analysis of the Heart data
using SAS VIYA as a tool of preprocessing and visualization.
Before showing the details of each data, a snippet of the data set is generated along with
identifying the type of each column.
1.1 Sample of the Initial Data Set
Figure 1.1-1 Sample of the data set
6|Page
1.2 Data Type
2. Exploratory Data Analysis (Before Pre-processing)

2.1 Descriptive statistics
The following section will show the descriptive table of each of the columns in the Heart
dataset. The details shown will vary based on the type of data.
• Categorical : Frequency, Percent, Cumulative Frequency, Cumulative Percent

• Numerical : N, N miss (number of missing value), Minimum, Maximum, Median, 25th
percentile (Q1), 75th percentile (q2), 95th percentile, Mode, Mean, Variance, and
Standard Deviation.
2.1.1 Status
7|Page
Figure 2.1-1 Descriptive Statistics of Status
2.1.2 DeathCause
Figure 2.1-2 Descriptive Statistics of Death Cause
2.1.3 AgeCHDdiag
Figure 2.1-3 Descriptive Statistics of Age CHD Diagnosed
8|Page
2.1.4 Sex
Figure 2.1-4 Descriptive Statistics of Sex
2.1.5 AgeAtStart
Figure 2.1-5 Descriptive Statistics of Age at Start
2.1.6 Height
Figure 2.1-6 Descriptive Statistics of Height
2.1.7 Weight
Figure 2.1-7 Descriptive Statistics of Weight
9|Page
2.1.8 Diastolic
Figure 2.1-8 Descriptive Statistics of Diastolic
2.1.9 Systolic
Figure 2.1-9 Descriptive Statistics of Systolic
2.1.10 MRW
Figure 2.1-10 Descriptive Statistics of MRW
2.1.11 Smoking
Figure 2.1-11 Descriptive Statistics of Smoking
10 | P a g e
2.1.12 AgeAtDeath
Figure 2.1-12 Descriptive Statistics of Age at Death
2.1.13 Cholesterol
Figure 2.1-13 Descriptive Statistics of Cholesterol
2.1.14 Chol_Status
Figure 2.1-14 Descriptive Statistics of Cholesterol Status
11 | P a g e
2.1.15 BP_Status
Figure 2.1-15 Descriptive Statistics of Blood Pressure Status
2.1.16 Weight_Status
Figure 2.1-16 Descriptive Statistics of Weight Status
2.1.17 Smoking_Status
Figure 2.1-17 Descriptive Statistics of Smoking Status
12 | P a g e
2.2 Univariate graph for each column

Since the data is yet to be processed, univariate analysis of each of the columns is first done
to plot the unprocessed data while still showing the missing value.
2.2.1 Status
Figure 2.2-1 Pie Chart of Status
It is seen from the figure above that the percentage of the person alive is bigger than dead.
2.2.2 DeathCause
13 | P a g e
Figure 2.2-2 Bar Chart of Death Cause
Since most of the Status of the person is alive, the DeathCause column shows more missing
value as there is no cause of death.
14 | P a g e
2.2.3 AgeCHDdiag
Figure 2.2-3 Histogram of Age CHD Diagnosed
As seen from above, the majority of the data can be seen from around 50-70.
15 | P a g e
2.2.4 Sex
Figure 2.2-4 Pie Chart of Sex
Most of the gender in the data is on Female at 55.2%.
16 | P a g e
2.2.5 AgeAtStart
Figure 2.2-5 Histogram of Age at Start
The histogram above shows that the majority of the age is vary with the peak at more than
400 occurrences between the age of 30-40.
17 | P a g e
2.2.6 Height
Figure 2.2-6 Histogram of Height
The histogram of Height shows that the majority of the height can be identified between 60 to
67.
18 | P a g e
2.2.7 Weight
Figure 2.2-7 Histogram of Weight
The density of the Weight column can be seen from around 120 to 180 at the peak just before
150.
19 | P a g e
2.2.8 Diastolic
Figure 2.2-8 Histogram of Diastolic
The peak of the Diastolic column above can be seen ar right above 75 at almost 1200
occurrence.
20 | P a g e
2.2.9 Systolic
Figure 2.2-9 Histogram of Systolic
From Figure 2.2-9, we can identify that the peak of Systolic having around 900 occurrences
from the dataset can be seen between approximately 120-140.
21 | P a g e
2.2.10 MRW
Figure 2.2-10 Histogram of MRW
The peak of the histogram can be seen with more than 900 occurrences scaling from 110 to
130.
22 | P a g e
2.2.11 Smoking
Figure 2.2-11 Histogram of Smoking
Since most of the data include 0 as the Smoking value, it can also be concluded that the
number of non-smoking people is more.
23 | P a g e
2.2.12 AgeAtDeath
Figure 2.2-12 Histogram of Age of Death
We can conclude from the histogram above that the majority age of death can be seen
between 60 to more than 80.
24 | P a g e
2.2.13 Cholesterol
Figure 2.2-13 Histogram of Cholesterol
We can see that the cholesterol value is the most at around 200 to approximately 270.
25 | P a g e
2.2.14 Chol_Status
Figure 2.2-14 Bar Chart of Cholesterol Status
From Figure 2.2-14, the cholesterol status of most of the data is at either borderline level or
high level.
2.2.15 BP_Status
Figure 2.2-15 Pie Chart of Blood Pressure Status
Most of the blood pressure status in the data is at high level at 43.5% followed by normal
level at 41.1%.
26 | P a g e
2.2.16 Weight_Status
Figure 2.2-16 Bar Chart of Weight Status
We can also see from the bar chart above that most of the weight status in the data is
Overweight at 3550 occurences.
2.2.17 Smoking_Status
Figure 2.2-17 Bar Chart of Smoking Status
The number of non-smokers can be seen to be the most, similarly from the histogram of
Smoking.
27 | P a g e
3. Data Pre-processing
3.1 Incomplete Data
3.1.1 Showing the list of missing value in the dataset
The figures below will show all the missing values in the dataset.
28 | P a g e
29 | P a g e
Figure 3.1-1 List of missing values
30 | P a g e
3.1.2 Replacing the missing value in DeathCause
Figure 3.1-2 Replacing missing value in DeathCause
Since some of the data has the status “Alive”, the row might not have any value in the
DeathCause column, therefore the value is replaced with “Alive”.
Figure 3.1-3 The new frequency table of DeathCause
3.1.3 Replacing the missing value in the Height by the mean and standard
deviation
Figure 3.1-4 Replacing the missing values with the mean and standard deviation
In the case of numerical data, the value is replaced with a sample of data between the MEAN
± STANDARD DEVIATION (Bhandari, 2022). This will range around 68% of the mean
value.
31 | P a g e
Figure 3.1-5 Example of Normal Distribution plot
Figure 3.1-6 The new descriptive table of Height
3.1.4 Replacing the missing value in Weight by the mean and standard
deviation
The Weight also follows the same principle of replacing the missing values as the Height.
32 | P a g e
Figure 3.1-8 The new descriptive table of weight
3.1.5 Replacing the missing value in MRW by the mean and standard
deviation
The MRW column also undergoes the same treatment of the missing values by replacing it
with the mean and standard deviation.
Figure 3.1-10 The new descriptive table of MRW
3.1.6 Replacing the missing value in AgeCHDdiag with 0
Figure 3.1-11 Replacing the missing values with 0
Since the missing age of CHD diagnosed is unknown, the values are replaced with 0 that will
mark that it is unknown.
33 | P a g e
Figure 3.1-12 The new descriptive table of AgeCHDdiag
3.1.7 Replacing the missing value in Smoking with the Mode Value
Figure 3.1-13 Replacing the missing values with the Mode
In the case of Smoking, the missing values are replaced with the most frequent value which is
0. This is done due to the data having a high standard deviation and mean which can result to
a vary data.
Figure 3.1-14 The new descriptive table of Smoking
3.1.8 Replacing the missing value in Smoking_Status with the Mode Value
Figure 3.1-15 Replacing the missing values with based on the Smoking value
34 | P a g e
Since the missing value of the Smoking has been fixed, we can now fixed the
Smoking_Status based on the value of the Smoking following the range specified in the
dataset.
Figure 3.1-16 New descriptive table of Smoking
3.1.9 Replacing the missing value in AgeAtDeath with 0 if Alive
Figure 3.1-17 Replacing the missing value of 'Alive' with 0
If the status is still ‘Alive’ the AgeAtDeath will then be replaced by 0 otherwise the value
will be replaced by the mode which is 68.
Figure 3.1-18 New descriptive table of AgeAtDeath
35 | P a g e
3.1.10 Replacing the missing value in Cholesterol by the mean and standard
deviation
The Cholesterol missing values are also replaced by the mean and standard deviation
sampling.
Figure 3.1-20 New descriptive table of Cholesterol
3.1.11 Replacing the missing value in Chol_Status based on the Cholesterol
Figure 3.1-21 Showing the range of Cholesterol Status
Figure 3.1-22 Replacing the data based on the range
36 | P a g e
Based on the range shown in Figure 3.1-20, we can replace the missing values based on the
Cholesterol value that has been fixed in the previous section.
Figure 3.1-23 New descriptive table of Chol_Status
3.1.12 Replacing the missing value in Weight_Status based on the MRW
Figure 3.1-24 MRW ranges between the Weight_Status
Figure 3.1-25 Replacing the data based on the range
Similar to Chol_Status, the Weight_Status column is fixed by identifying the range of MRW
which directly affect the Weight_Status and replaced it based on the MRW.
37 | P a g e
Figure 3.1-26 New descriptive table of Weight_Status
38 | P a g e
3.1.13 Data with no missing value
39 | P a g e
40 | P a g e
Figure 3.1-27 Data with no missing value
3.2 Noisy Data

In detecting noisy data, the first step is to find the outliers in the numerical data. This can be
seen by plotting box plots for each of the numerical columns. In this case, the outliers that are
to be removed are only those with an extreme value – mild outliers will still be left.
41 | P a g e
3.2.1 Age CHD Diagnosed Without the 0 Value
Figure 3.2-1 Boxplot of AgeCHDdiag
Since the 0 value of the AgeCHDdiag means that it is unknown, the value is excluded to plot
the box plot. Since the outliers are not extreme, removal of outliers can be ignored.
42 | P a g e
3.2.2 Age At Start
Figure 3.2-2 Boxplot of AgeAtStart
As seen from the box plot, the column shows no outlier, hence the column is clean.
43 | P a g e
3.2.3 Height
Figure 3.2-3 Boxplot of Height
Figure 3.2-4 Code to replace the outlier
Since only the extreme value is removed, the formula used is
44 | P a g e
IQR = Q3-Q1
Upper Limit = Q3 + 3*IQR
Lower Limit = Q1 – 3*IQR
(Outliers, n.d.)
This is different from the common formula of 1.5 * IQR, as the goal is to fix only the extreme
outliers. Q3 and Q1 can be seen from the descriptive statistic table that has been generated
from the previous section which are marked as 25th pctl and 75th pctl.
After limiting the data for the outliers, the value is then replaced with either the upper or
lower limit value.
Figure 3.2-5 New boxplot of Height
3.2.4 Weight
45 | P a g e
Figure 3.2-6 Boxplot of Weight
Figure 3.2-7 Code to fix the outliers
46 | P a g e
Figure 3.2-8 New boxplot of weight
47 | P a g e
3.2.5 Diastolic
Figure 3.2-9 Boxplot of Diastolic
48 | P a g e
Figure 3.2-11 New boxplot of Diastolic
49 | P a g e
3.2.6 Systolic
Figure 3.2-12 Boxplot of Systolic
50 | P a g e
Figure 3.2-14 New boxplot of Systolic
51 | P a g e
3.2.7 MRW
Figure 3.2-15 Boxplot of MRW
52 | P a g e
Figure 3.2-17 New boxplot of MRW
53 | P a g e
3.2.8 Smoking
Figure 3.2-18 Boxplot of Smoking
54 | P a g e
Figure 3.2-20 New boxplot of Smoking
55 | P a g e
3.2.9 Age At Death Without the 0 Value
Figure 3.2-21 Boxplot of Age At Death
As the 0 value refers to the status being alive, the value is ignored during the production of
the box plot. No extreme outlier is detected based on the figure above.
56 | P a g e
3.2.10 Cholesterol
Figure 3.2-22 Boxplot of Cholesterol
57 | P a g e
Figure 3.2-24 New boxplot of Cholesterol
3.2.11 Explanation
As seen from all the new box plots, outliers can still be seen but since the goal of fixing this
section is to just fix the extreme value to retain the originality of the data, some mild outliers
can still be detected. This outlier will not reduce the accuracy in the further analysis.
3.3 Inconsistent Data

Inconsistent data can be defined as data having wrong format throughout the values of the
column or having wrong data types.
58 | P a g e
Figure 3.3-1 No Attribute of the Wrong Data Type
Since there is no wrong data type across the attributes and as seen from Section 2.1 there is
also no incorrect data throughout the value of the dataset; no changes will be made on this
part.
The only format difference can be seen on the Height column.
59 | P a g e
Figure 3.3-2 Height Column
As seen above, there is no standard in the value of the column as it shows value with different
decimal points. Moreover, compared to the other numerical column of the same type, namely
Weight, MRW, etc; only Height column contains decimal value. Therefore a standardization
will be made by rounding the Height to the nearest integer
Figure 3.3-3 Changing the format of the column
60 | P a g e
Figure 3.3-4 New format of the Height Column
3.4 Data Transformation

3.4.1 Adding a unique ID for each row
Since there is no primary key in the data, a unique ID column is generated to tackle the issue.
Figure 3.4-1 Code to generate a unique ID
61 | P a g e
Figure 3.4-2 A new 'ID' column created
3.4.2 Adding a new column for the range of Age CHD Diagnosed
For all the Age attributes, a new column is created to show the range and status of the age to
assist the analysis. The age is then grouped based on the common age grouping of Child,
Young Adults, Middle-aged Adults, and Senior.
Figure 3.4-3 Code to generate a new column for AgeCHDStat
As it is specified while fixing the mixing value, the 0 value is classified as “Unknown”.
62 | P a g e
Figure 3.4-4 New column created for AgeCHDStat
3.4.3 Adding a new column for the range of Age At Start
Figure 3.4-5 Code to generate a new column for AgeAtStart
63 | P a g e
Figure 3.4-6 New column of AgeStartStat
3.4.4 Adding a new column for the range of Age At Death
Figure 3.4-7 Code to generate new column of AgeDeathStat
As specified while fixing the missing value, the 0 value refers to the person being alive,
hence the labeling.
64 | P a g e
Figure 3.4-8 New column of AgeDeathStat
4. Visualization and Analysis of the Processed Data

4.1 Final Data Set
After the data set is processed, the final and new data set can be seen from the snippet below.
65 | P a g e
Figure 4.1-1 Sample of final data set
66 | P a g e
4.2 DeathCause by Status
Figure 4.2-1 Bar Chart of DeathCause based on Status
As seen from the bar chart above, apart from the Alive status, we can see that coronary heart
disease contribute the most as the death cause at 605.
67 | P a g e
4.3 AgeStartStat
Figure 4.3-1 Bar Chart of AgeStartStat
As seen from the figure above, most of the data in age at start is in Middle-aged Adults
followed by senior and then Young Adults.
68 | P a g e
4.4 AgeCHDStat
Figure 4.4-1 Bar Chart of AgeCHDStat
As seen from the figure above, apart from the unknown data, CHD is mostly diagnosed in
Senior age which is above 50.
69 | P a g e
4.5 AgeDeathStat
Figure 4.5-1 Bar Chart of AgeDeathStat
The Age of death is also mostly seen in Senior class with age above 50 at 1928 count.
70 | P a g e
4.6 DeathCause group by AgeDeathStat
Figure 4.6-1 Bar Chart of Death Cause group by the Age of Death
Filtering out the subjects that are still alive, we can see that the highest cause of death for
senior citizen is on coronary heart disease followed by Cancer. While for Middle-aged
Adults, the 2nd highest death cause can be seen on Cancer with the 1st being other causes.
71 | P a g e
4.7 Number of patients with CHD diagnosed group by Status
Figure 4.7-1 Number of patients with CHD diagnosed group by Status
From the figure above, we can conclude that more than half of the people who have been
diagnosed with CHD are death, mostly in the Senior age at 814 counts.
72 | P a g e
4.8 Weight and Height Group By Sex
Figure 4.8-1 Scatter plot of Height and Weight Group by Sex
As seen from the scatter plot, the Female data dominates the lower left of the data, meaning
that they have smaller Height and Weight overall while the Male data dominates the other
part. Despite that we can see that the largest Weight can be seen at almost 300 with height
between 60-65.
73 | P a g e
4.9 Weight Status Group by Sex
Figure 4.9-1 Bar Chart of Weight Status Group by Sex
We can conclude that most of the people in the data are overweight with female having the
highest count at 1909 overweight cases.
74 | P a g e
4.10 Systolic and Diastolic Group By Blood Pressure Status
Figure 4.10-1 Systolic and Diastolic Group by Blood Pressure Status
From the figure above, we can conclude that the higher the Diastolic and Systolic, the higher
the Blood Pressure will be. The optimal Diastolic is around 70 with Systolic at approximately
100 to 110.
4.11 Correlation between Height, Weight and MRW
Figure 4.11-1 Correlation between Height, Weight, and MRW
75 | P a g e
From the correlation table above, we can conclude that the factor that contributes the most to
MRW is Weight at 0.76 while Height at -0.13 showing a negative correlation. Weight and
Height also have a mild correlation of 0.51.
4.12 Weight and MRW Group by Sex
Figure 4.12-1 Scatter Plot between MRW and Weight Group by Sex
We can see that Male has a higher overall Weight throughout the data compared to Female.
We can also see that the higher the weight is the higher the MRW.
76 | P a g e
4.13 Height and Weight Group by Weight Status
Figure 4.13-1 Scatter plot of Height and Weight Group by Weight Status
From the scatter plot above we can clearly see the difference in the weight status as the
Overweight data having the most frequency.
77 | P a g e
4.14 Cholesterol Level Compared to Age of Death
Figure 4.14-1 Heatmap of Cholesterol level compared to Age of Death
The heatmap above shows that the most frequent age of death can be seen ranging from 60-
80, while having cholesterol level from 200 to just above 250.
78 | P a g e
4.15 Smoking Level and Cholesterol Compared to Age of Death
Figure 4.15-1 Heatmap of Smoking level and Cholesterol compared to Age of Death
The heatmap above shows the age, smoking, and cholesterol range. As seen from above, the
higher the smoking level, the higher the Cholesterol is and with both of that attributes high,
the age of death can even be lower than the average.
4.16 Relationship between blood pressure, Cholesterol, and MRW

group by Sex
79 | P a g e
Figure 4.16-1 Relationship between blood pressure, Cholesterol, and MRW group by Sex
We can conclude from the figure above that the higher the MRW – higher weight status
which can lead to overweight, and Cholesterol, the higher the blood pressure is. The graph
shows slightly higher Cholesterol and MRW rate in the female data.
5. Conclusion
The paper has clearly shown the before and after pre-processing of the data set. Some
changes have been made to process the data for a more accurate analysis by replacing the
missing value and also the outliers that can be found throughout the numerical data set. The
more accurate analysis is also shown by plotting the graph based on the new data set. Overall,
the paper has fulfilled all the requirements of the assignment.
80 | P a g e
6. References
Bhandari, P. (2022, September 26). How to Calculate the Standard Deviation | Formula,
Meaning & Examples. Scribbr. https://www.scribbr.co.uk/stats/standard-deviation-
meaning/
Outliers. (n.d.). https://bolt.mph.ufl.edu/6050-6052/unit-1/one-quantitative-variable-
introduction/understanding-outliers/
81 | P a g e

Emilio Garvin - Data Management

Uploaded by

Copyright:

Available Formats

You might also like

Emilio Garvin - Data Management

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Emilio Garvin - Data Management

Uploaded by

Copyright:

Available Formats

CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

HAND OUT DATE: 10 OCTOBER 2022

HAND IN DATE: 30 DESEMBER 2022

NAME : Emilio Garvin

1 Submit your assignment at the administrative counter.

3 Late submission will be awarded zero (0) unless Extenuating

4 Cases of plagiarism will be penalized.

5 The assignment should be bound in an appropriate style (comb bound or

6 Where the assignment should be submitted in both hardcopy and

appropriate) should be on a CD in an envelope / CD cover and attached to

7 You must obtain 50% overall to pass this module.

2.2.10 MRW ............................................................................................................................. 22

3.1 Data Transformation ............................................................................................................. 61

1. Introduction and Data Set

1.1 Sample of the Initial Data Set

Figure 1.1-1 Sample of the data set

1.2 Data Type

2. Exploratory Data Analysis (Before Pre-processing)

• Categorical : Frequency, Percent, Cumulative Frequency, Cumulative Percent

Figure 2.1-1 Descriptive Statistics of Status

Figure 2.1-2 Descriptive Statistics of Death Cause

Figure 2.1-3 Descriptive Statistics of Age CHD Diagnosed

Figure 2.1-4 Descriptive Statistics of Sex

Figure 2.1-5 Descriptive Statistics of Age at Start

Figure 2.1-6 Descriptive Statistics of Height

Figure 2.1-7 Descriptive Statistics of Weight

Figure 2.1-8 Descriptive Statistics of Diastolic

Figure 2.1-9 Descriptive Statistics of Systolic

Figure 2.1-10 Descriptive Statistics of MRW

Figure 2.1-11 Descriptive Statistics of Smoking

Figure 2.1-12 Descriptive Statistics of Age at Death

Figure 2.1-13 Descriptive Statistics of Cholesterol

Figure 2.1-14 Descriptive Statistics of Cholesterol Status

Figure 2.1-15 Descriptive Statistics of Blood Pressure Status

Figure 2.1-16 Descriptive Statistics of Weight Status

Figure 2.1-17 Descriptive Statistics of Smoking Status

2.2 Univariate graph for each column

Figure 2.2-1 Pie Chart of Status

Figure 2.2-2 Bar Chart of Death Cause

Figure 2.2-3 Histogram of Age CHD Diagnosed

Figure 2.2-4 Pie Chart of Sex

Most of the gender in the data is on Female at 55.2%.

Figure 2.2-5 Histogram of Age at Start

Figure 2.2-6 Histogram of Height

Figure 2.2-7 Histogram of Weight

Figure 2.2-8 Histogram of Diastolic

Figure 2.2-9 Histogram of Systolic

Figure 2.2-10 Histogram of MRW

Figure 2.2-11 Histogram of Smoking

Figure 2.2-12 Histogram of Age of Death

Figure 2.2-13 Histogram of Cholesterol

Figure 2.2-14 Bar Chart of Cholesterol Status

Figure 2.2-15 Pie Chart of Blood Pressure Status

Figure 2.2-16 Bar Chart of Weight Status

Figure 2.2-17 Bar Chart of Smoking Status

Figure 3.1-1 List of missing values

3.1.2 Replacing the missing value in DeathCause