Professional Documents
Culture Documents
Emilio Garvin - Data Management
Emilio Garvin - Data Management
Emilio Garvin - Data Management
ASSIGNMENT
TECHNOLOGY PARK MALAYSIA
CT075-3-2-DTM
DATA MANAGEMENT
DATA PRE-PROCESSING
APD2F2209CSDA
WEIGHTAGE: 25%
TP NUMBER : TP062977
INSTRUCTIONS TO CANDIDATES:
2 Students are advised to underpin their answers with the use of references
(cited using the American Psychological Association (APA) Referencing).
1|Page
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
2|Page
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
Table of Contents
1. Introduction and Data Set ............................................................................................................... 6
1.1 Sample of the Initial Data Set ................................................................................................. 6
1.2 Data Type ................................................................................................................................ 7
2. Exploratory Data Analysis (Before Pre-processing) ........................................................................ 7
2.1 Descriptive statistics ............................................................................................................... 7
2.1.1 Status .............................................................................................................................. 7
2.1.2 DeathCause ..................................................................................................................... 8
2.1.3 AgeCHDdiag..................................................................................................................... 8
2.1.4 Sex ................................................................................................................................... 9
2.1.5 AgeAtStart ....................................................................................................................... 9
2.1.6 Height .............................................................................................................................. 9
2.1.7 Weight ............................................................................................................................. 9
2.1.8 Diastolic ......................................................................................................................... 10
2.1.9 Systolic .......................................................................................................................... 10
2.1.10 MRW ............................................................................................................................. 10
2.1.11 Smoking ......................................................................................................................... 10
2.1.12 AgeAtDeath ................................................................................................................... 11
2.1.13 Cholesterol .................................................................................................................... 11
2.1.14 Chol_Status ................................................................................................................... 11
2.1.15 BP_Status ...................................................................................................................... 12
2.1.16 Weight_Status ............................................................................................................... 12
2.1.17 Smoking_Status............................................................................................................. 12
2.2 Univariate graph for each column ........................................................................................ 13
2.2.1 Status ............................................................................................................................ 13
2.2.2 DeathCause ................................................................................................................... 13
2.2.3 AgeCHDdiag................................................................................................................... 15
2.2.4 Sex ................................................................................................................................. 16
2.2.5 AgeAtStart ..................................................................................................................... 17
2.2.6 Height ............................................................................................................................ 18
2.2.7 Weight ........................................................................................................................... 19
2.2.8 Diastolic ......................................................................................................................... 20
2.2.9 Systolic .......................................................................................................................... 21
3|Page
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
4|Page
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
5|Page
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
6|Page
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
2.1.1 Status
7|Page
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
2.1.2 DeathCause
2.1.3 AgeCHDdiag
8|Page
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
2.1.4 Sex
2.1.5 AgeAtStart
2.1.6 Height
2.1.7 Weight
9|Page
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
2.1.8 Diastolic
2.1.9 Systolic
2.1.10 MRW
2.1.11 Smoking
10 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
2.1.12 AgeAtDeath
2.1.13 Cholesterol
2.1.14 Chol_Status
11 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
2.1.15 BP_Status
2.1.16 Weight_Status
2.1.17 Smoking_Status
12 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
2.2.1 Status
It is seen from the figure above that the percentage of the person alive is bigger than dead.
2.2.2 DeathCause
13 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
Since most of the Status of the person is alive, the DeathCause column shows more missing
value as there is no cause of death.
14 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
2.2.3 AgeCHDdiag
As seen from above, the majority of the data can be seen from around 50-70.
15 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
2.2.4 Sex
16 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
2.2.5 AgeAtStart
The histogram above shows that the majority of the age is vary with the peak at more than
400 occurrences between the age of 30-40.
17 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
2.2.6 Height
The histogram of Height shows that the majority of the height can be identified between 60 to
67.
18 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
2.2.7 Weight
The density of the Weight column can be seen from around 120 to 180 at the peak just before
150.
19 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
2.2.8 Diastolic
The peak of the Diastolic column above can be seen ar right above 75 at almost 1200
occurrence.
20 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
2.2.9 Systolic
From Figure 2.2-9, we can identify that the peak of Systolic having around 900 occurrences
from the dataset can be seen between approximately 120-140.
21 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
2.2.10 MRW
The peak of the histogram can be seen with more than 900 occurrences scaling from 110 to
130.
22 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
2.2.11 Smoking
Since most of the data include 0 as the Smoking value, it can also be concluded that the
number of non-smoking people is more.
23 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
2.2.12 AgeAtDeath
We can conclude from the histogram above that the majority age of death can be seen
between 60 to more than 80.
24 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
2.2.13 Cholesterol
We can see that the cholesterol value is the most at around 200 to approximately 270.
25 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
2.2.14 Chol_Status
From Figure 2.2-14, the cholesterol status of most of the data is at either borderline level or
high level.
2.2.15 BP_Status
Most of the blood pressure status in the data is at high level at 43.5% followed by normal
level at 41.1%.
26 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
2.2.16 Weight_Status
We can also see from the bar chart above that most of the weight status in the data is
Overweight at 3550 occurences.
2.2.17 Smoking_Status
The number of non-smokers can be seen to be the most, similarly from the histogram of
Smoking.
27 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
3. Data Pre-processing
3.1 Incomplete Data
3.1.1 Showing the list of missing value in the dataset
The figures below will show all the missing values in the dataset.
28 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
29 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
30 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
Since some of the data has the status “Alive”, the row might not have any value in the
DeathCause column, therefore the value is replaced with “Alive”.
3.1.3 Replacing the missing value in the Height by the mean and standard
deviation
Figure 3.1-4 Replacing the missing values with the mean and standard deviation
In the case of numerical data, the value is replaced with a sample of data between the MEAN
± STANDARD DEVIATION (Bhandari, 2022). This will range around 68% of the mean
value.
31 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
3.1.4 Replacing the missing value in Weight by the mean and standard
deviation
Figure 3.1-7 Replacing the missing values with the mean and standard deviation
The Weight also follows the same principle of replacing the missing values as the Height.
32 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
3.1.5 Replacing the missing value in MRW by the mean and standard
deviation
Figure 3.1-9 Replacing the missing values with the mean and standard deviation
The MRW column also undergoes the same treatment of the missing values by replacing it
with the mean and standard deviation.
Since the missing age of CHD diagnosed is unknown, the values are replaced with 0 that will
mark that it is unknown.
33 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
3.1.7 Replacing the missing value in Smoking with the Mode Value
In the case of Smoking, the missing values are replaced with the most frequent value which is
0. This is done due to the data having a high standard deviation and mean which can result to
a vary data.
3.1.8 Replacing the missing value in Smoking_Status with the Mode Value
Figure 3.1-15 Replacing the missing values with based on the Smoking value
34 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
Since the missing value of the Smoking has been fixed, we can now fixed the
Smoking_Status based on the value of the Smoking following the range specified in the
dataset.
If the status is still ‘Alive’ the AgeAtDeath will then be replaced by 0 otherwise the value
will be replaced by the mode which is 68.
35 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
3.1.10 Replacing the missing value in Cholesterol by the mean and standard
deviation
Figure 3.1-19 Replacing the missing values with the mean and standard deviation
The Cholesterol missing values are also replaced by the mean and standard deviation
sampling.
36 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
Based on the range shown in Figure 3.1-20, we can replace the missing values based on the
Cholesterol value that has been fixed in the previous section.
Similar to Chol_Status, the Weight_Status column is fixed by identifying the range of MRW
which directly affect the Weight_Status and replaced it based on the MRW.
37 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
38 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
39 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
40 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
41 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
Since the 0 value of the AgeCHDdiag means that it is unknown, the value is excluded to plot
the box plot. Since the outliers are not extreme, removal of outliers can be ignored.
42 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
As seen from the box plot, the column shows no outlier, hence the column is clean.
43 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
3.2.3 Height
44 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
IQR = Q3-Q1
(Outliers, n.d.)
This is different from the common formula of 1.5 * IQR, as the goal is to fix only the extreme
outliers. Q3 and Q1 can be seen from the descriptive statistic table that has been generated
from the previous section which are marked as 25th pctl and 75th pctl.
After limiting the data for the outliers, the value is then replaced with either the upper or
lower limit value.
3.2.4 Weight
45 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
46 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
47 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
3.2.5 Diastolic
48 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
49 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
3.2.6 Systolic
50 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
51 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
3.2.7 MRW
52 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
53 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
3.2.8 Smoking
54 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
55 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
As the 0 value refers to the status being alive, the value is ignored during the production of
the box plot. No extreme outlier is detected based on the figure above.
56 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
3.2.10 Cholesterol
57 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
3.2.11 Explanation
As seen from all the new box plots, outliers can still be seen but since the goal of fixing this
section is to just fix the extreme value to retain the originality of the data, some mild outliers
can still be detected. This outlier will not reduce the accuracy in the further analysis.
58 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
Since there is no wrong data type across the attributes and as seen from Section 2.1 there is
also no incorrect data throughout the value of the dataset; no changes will be made on this
part.
59 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
As seen above, there is no standard in the value of the column as it shows value with different
decimal points. Moreover, compared to the other numerical column of the same type, namely
Weight, MRW, etc; only Height column contains decimal value. Therefore a standardization
will be made by rounding the Height to the nearest integer
60 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
61 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
3.4.2 Adding a new column for the range of Age CHD Diagnosed
For all the Age attributes, a new column is created to show the range and status of the age to
assist the analysis. The age is then grouped based on the common age grouping of Child,
Young Adults, Middle-aged Adults, and Senior.
As it is specified while fixing the mixing value, the 0 value is classified as “Unknown”.
62 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
63 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
As specified while fixing the missing value, the 0 value refers to the person being alive,
hence the labeling.
64 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
65 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
66 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
As seen from the bar chart above, apart from the Alive status, we can see that coronary heart
disease contribute the most as the death cause at 605.
67 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
4.3 AgeStartStat
As seen from the figure above, most of the data in age at start is in Middle-aged Adults
followed by senior and then Young Adults.
68 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
4.4 AgeCHDStat
As seen from the figure above, apart from the unknown data, CHD is mostly diagnosed in
Senior age which is above 50.
69 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
4.5 AgeDeathStat
The Age of death is also mostly seen in Senior class with age above 50 at 1928 count.
70 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
Figure 4.6-1 Bar Chart of Death Cause group by the Age of Death
Filtering out the subjects that are still alive, we can see that the highest cause of death for
senior citizen is on coronary heart disease followed by Cancer. While for Middle-aged
Adults, the 2nd highest death cause can be seen on Cancer with the 1st being other causes.
71 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
From the figure above, we can conclude that more than half of the people who have been
diagnosed with CHD are death, mostly in the Senior age at 814 counts.
72 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
As seen from the scatter plot, the Female data dominates the lower left of the data, meaning
that they have smaller Height and Weight overall while the Male data dominates the other
part. Despite that we can see that the largest Weight can be seen at almost 300 with height
between 60-65.
73 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
We can conclude that most of the people in the data are overweight with female having the
highest count at 1909 overweight cases.
74 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
From the figure above, we can conclude that the higher the Diastolic and Systolic, the higher
the Blood Pressure will be. The optimal Diastolic is around 70 with Systolic at approximately
100 to 110.
75 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
From the correlation table above, we can conclude that the factor that contributes the most to
MRW is Weight at 0.76 while Height at -0.13 showing a negative correlation. Weight and
Height also have a mild correlation of 0.51.
Figure 4.12-1 Scatter Plot between MRW and Weight Group by Sex
We can see that Male has a higher overall Weight throughout the data compared to Female.
We can also see that the higher the weight is the higher the MRW.
76 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
Figure 4.13-1 Scatter plot of Height and Weight Group by Weight Status
From the scatter plot above we can clearly see the difference in the weight status as the
Overweight data having the most frequency.
77 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
The heatmap above shows that the most frequent age of death can be seen ranging from 60-
80, while having cholesterol level from 200 to just above 250.
78 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
Figure 4.15-1 Heatmap of Smoking level and Cholesterol compared to Age of Death
The heatmap above shows the age, smoking, and cholesterol range. As seen from above, the
higher the smoking level, the higher the Cholesterol is and with both of that attributes high,
the age of death can even be lower than the average.
79 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
Figure 4.16-1 Relationship between blood pressure, Cholesterol, and MRW group by Sex
We can conclude from the figure above that the higher the MRW – higher weight status
which can lead to overweight, and Cholesterol, the higher the blood pressure is. The graph
shows slightly higher Cholesterol and MRW rate in the female data.
5. Conclusion
The paper has clearly shown the before and after pre-processing of the data set. Some
changes have been made to process the data for a more accurate analysis by replacing the
missing value and also the outliers that can be found throughout the numerical data set. The
more accurate analysis is also shown by plotting the graph based on the new data set. Overall,
the paper has fulfilled all the requirements of the assignment.
80 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA
6. References
Bhandari, P. (2022, September 26). How to Calculate the Standard Deviation | Formula,
Meaning & Examples. Scribbr. https://www.scribbr.co.uk/stats/standard-deviation-
meaning/
Outliers. (n.d.). https://bolt.mph.ufl.edu/6050-6052/unit-1/one-quantitative-variable-
introduction/understanding-outliers/
81 | P a g e