Emilio Garvin - Data Management

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 81

CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

ASSIGNMENT
TECHNOLOGY PARK MALAYSIA

CT075-3-2-DTM
DATA MANAGEMENT
DATA PRE-PROCESSING

APD2F2209CSDA

HAND OUT DATE: 10 OCTOBER 2022

HAND IN DATE: 30 DESEMBER 2022

WEIGHTAGE: 25%

NAME : Emilio Garvin

TP NUMBER : TP062977

INSTRUCTIONS TO CANDIDATES:

1 Submit your assignment at the administrative counter.

2 Students are advised to underpin their answers with the use of references
(cited using the American Psychological Association (APA) Referencing).

3 Late submission will be awarded zero (0) unless Extenuating


Circumstances (EC) are upheld.

4 Cases of plagiarism will be penalized.

5 The assignment should be bound in an appropriate style (comb bound or


stapled).

6 Where the assignment should be submitted in both hardcopy and


softcopy, the softcopy of the written assignment and source code (where

1|Page
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

appropriate) should be on a CD in an envelope / CD cover and attached to


the hardcopy.

7 You must obtain 50% overall to pass this module.

2|Page
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

Table of Contents
1. Introduction and Data Set ............................................................................................................... 6
1.1 Sample of the Initial Data Set ................................................................................................. 6
1.2 Data Type ................................................................................................................................ 7
2. Exploratory Data Analysis (Before Pre-processing) ........................................................................ 7
2.1 Descriptive statistics ............................................................................................................... 7
2.1.1 Status .............................................................................................................................. 7
2.1.2 DeathCause ..................................................................................................................... 8
2.1.3 AgeCHDdiag..................................................................................................................... 8
2.1.4 Sex ................................................................................................................................... 9
2.1.5 AgeAtStart ....................................................................................................................... 9
2.1.6 Height .............................................................................................................................. 9
2.1.7 Weight ............................................................................................................................. 9
2.1.8 Diastolic ......................................................................................................................... 10
2.1.9 Systolic .......................................................................................................................... 10
2.1.10 MRW ............................................................................................................................. 10
2.1.11 Smoking ......................................................................................................................... 10
2.1.12 AgeAtDeath ................................................................................................................... 11
2.1.13 Cholesterol .................................................................................................................... 11
2.1.14 Chol_Status ................................................................................................................... 11
2.1.15 BP_Status ...................................................................................................................... 12
2.1.16 Weight_Status ............................................................................................................... 12
2.1.17 Smoking_Status............................................................................................................. 12
2.2 Univariate graph for each column ........................................................................................ 13
2.2.1 Status ............................................................................................................................ 13
2.2.2 DeathCause ................................................................................................................... 13
2.2.3 AgeCHDdiag................................................................................................................... 15
2.2.4 Sex ................................................................................................................................. 16
2.2.5 AgeAtStart ..................................................................................................................... 17
2.2.6 Height ............................................................................................................................ 18
2.2.7 Weight ........................................................................................................................... 19
2.2.8 Diastolic ......................................................................................................................... 20
2.2.9 Systolic .......................................................................................................................... 21

3|Page
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

2.2.10 MRW ............................................................................................................................. 22


2.2.11 Smoking ......................................................................................................................... 23
2.2.12 AgeAtDeath ................................................................................................................... 24
2.2.13 Cholesterol .................................................................................................................... 25
2.2.14 Chol_Status ................................................................................................................... 26
2.2.15 BP_Status ...................................................................................................................... 26
2.2.16 Weight_Status ............................................................................................................... 27
2.2.17 Smoking_Status............................................................................................................. 27
3. Data Pre-processing ...................................................................................................................... 28
3.1 Incomplete Data.................................................................................................................... 28
3.1.1 Showing the list of missing value in the dataset ........................................................... 28
3.1.2 Replacing the missing value in DeathCause .................................................................. 31
3.1.3 Replacing the missing value in the Height by the mean and standard deviation ......... 31
3.1.4 Replacing the missing value in Weight by the mean and standard deviation .............. 32
3.1.5 Replacing the missing value in MRW by the mean and standard deviation................. 33
3.1.6 Replacing the missing value in AgeCHDdiag with 0 ...................................................... 33
3.1.7 Replacing the missing value in Smoking with the Mode Value .................................... 34
3.1.8 Replacing the missing value in Smoking_Status with the Mode Value ........................ 34
3.1.9 Replacing the missing value in AgeAtDeath with 0 if Alive ........................................... 35
3.1.10 Replacing the missing value in Cholesterol by the mean and standard deviation ....... 36
3.1.11 Replacing the missing value in Chol_Status based on the Cholesterol......................... 36
3.1.12 Replacing the missing value in Weight_Status based on the MRW ............................. 37
3.1.13 Data with no missing value ........................................................................................... 39
3.2 Noisy Data ............................................................................................................................. 41
3.2.1 Age CHD Diagnosed Without the 0 Value ..................................................................... 42
3.2.2 Age At Start ................................................................................................................... 43
3.2.3 Height ............................................................................................................................ 44
3.2.4 Weight ........................................................................................................................... 45
3.2.5 Diastolic ......................................................................................................................... 48
3.2.6 Systolic .......................................................................................................................... 50
3.2.7 MRW ............................................................................................................................. 52
3.2.8 Smoking ......................................................................................................................... 54
3.2.9 Age At Death Without the 0 Value................................................................................ 56
3.2.10 Cholesterol .................................................................................................................... 57
3.2.11 Explanation ................................................................................................................... 58
3.3 Inconsistent Data .................................................................................................................. 58

4|Page
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

3.1 Data Transformation ............................................................................................................. 61


3.1.1 Adding a unique ID for each row .................................................................................. 61
3.1.2 Adding a new column for the range of Age CHD Diagnosed ........................................ 62
3.1.3 Adding a new column for the range of Age At Start ..................................................... 63
3.1.4 Adding a new column for the range of Age At Death ................................................... 64
4. Visualization and Analysis of the Processed Data......................................................................... 65
4.1 Final Data Set ........................................................................................................................ 65
4.2 DeathCause by Status ........................................................................................................... 67
4.3 AgeStartStat .......................................................................................................................... 68
4.4 AgeCHDStat ........................................................................................................................... 69
4.5 AgeDeathStat ........................................................................................................................ 70
4.6 DeathCause group by AgeDeathStat..................................................................................... 71
4.7 Number of patients with CHD diagnosed group by Status ................................................... 72
4.8 Height and Weight Group By Sex .......................................................................................... 73
4.9 Weight Status Group by Sex ................................................................................................. 74
4.10 Systolic and Diastolic Group By Blood Pressure Status......................................................... 75
4.11 Correlation between Height, Weight and MRW ................................................................... 75
4.12 Weight and MRW Group by Sex ........................................................................................... 76
4.13 Height and Weight Group by Weight Status......................................................................... 77
4.14 Cholesterol Level Compared to Age of Death ....................................................................... 78
4.15 Smoking Level and Cholesterol Compared to Age of Death ................................................. 79
4.16 Relationship between blood pressure, Cholesterol, and MRW group by Sex ...................... 79
5. Conclusion ..................................................................................................................................... 80
6. References .................................................................................................................................... 81

5|Page
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

1. Introduction and Data Set


The following paper will show in details the pre-processing and analysis of the Heart data
using SAS VIYA as a tool of preprocessing and visualization.
Before showing the details of each data, a snippet of the data set is generated along with
identifying the type of each column.

1.1 Sample of the Initial Data Set

Figure 1.1-1 Sample of the data set

6|Page
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

1.2 Data Type

2. Exploratory Data Analysis (Before Pre-processing)


2.1 Descriptive statistics
The following section will show the descriptive table of each of the columns in the Heart
dataset. The details shown will vary based on the type of data.

• Categorical : Frequency, Percent, Cumulative Frequency, Cumulative Percent


• Numerical : N, N miss (number of missing value), Minimum, Maximum, Median, 25th
percentile (Q1), 75th percentile (q2), 95th percentile, Mode, Mean, Variance, and
Standard Deviation.

2.1.1 Status

7|Page
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

Figure 2.1-1 Descriptive Statistics of Status

2.1.2 DeathCause

Figure 2.1-2 Descriptive Statistics of Death Cause

2.1.3 AgeCHDdiag

Figure 2.1-3 Descriptive Statistics of Age CHD Diagnosed

8|Page
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

2.1.4 Sex

Figure 2.1-4 Descriptive Statistics of Sex

2.1.5 AgeAtStart

Figure 2.1-5 Descriptive Statistics of Age at Start

2.1.6 Height

Figure 2.1-6 Descriptive Statistics of Height

2.1.7 Weight

Figure 2.1-7 Descriptive Statistics of Weight

9|Page
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

2.1.8 Diastolic

Figure 2.1-8 Descriptive Statistics of Diastolic

2.1.9 Systolic

Figure 2.1-9 Descriptive Statistics of Systolic

2.1.10 MRW

Figure 2.1-10 Descriptive Statistics of MRW

2.1.11 Smoking

Figure 2.1-11 Descriptive Statistics of Smoking

10 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

2.1.12 AgeAtDeath

Figure 2.1-12 Descriptive Statistics of Age at Death

2.1.13 Cholesterol

Figure 2.1-13 Descriptive Statistics of Cholesterol

2.1.14 Chol_Status

Figure 2.1-14 Descriptive Statistics of Cholesterol Status

11 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

2.1.15 BP_Status

Figure 2.1-15 Descriptive Statistics of Blood Pressure Status

2.1.16 Weight_Status

Figure 2.1-16 Descriptive Statistics of Weight Status

2.1.17 Smoking_Status

Figure 2.1-17 Descriptive Statistics of Smoking Status

12 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

2.2 Univariate graph for each column


Since the data is yet to be processed, univariate analysis of each of the columns is first done
to plot the unprocessed data while still showing the missing value.

2.2.1 Status

Figure 2.2-1 Pie Chart of Status

It is seen from the figure above that the percentage of the person alive is bigger than dead.

2.2.2 DeathCause

13 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

Figure 2.2-2 Bar Chart of Death Cause

Since most of the Status of the person is alive, the DeathCause column shows more missing
value as there is no cause of death.

14 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

2.2.3 AgeCHDdiag

Figure 2.2-3 Histogram of Age CHD Diagnosed

As seen from above, the majority of the data can be seen from around 50-70.

15 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

2.2.4 Sex

Figure 2.2-4 Pie Chart of Sex

Most of the gender in the data is on Female at 55.2%.

16 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

2.2.5 AgeAtStart

Figure 2.2-5 Histogram of Age at Start

The histogram above shows that the majority of the age is vary with the peak at more than
400 occurrences between the age of 30-40.

17 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

2.2.6 Height

Figure 2.2-6 Histogram of Height

The histogram of Height shows that the majority of the height can be identified between 60 to
67.

18 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

2.2.7 Weight

Figure 2.2-7 Histogram of Weight

The density of the Weight column can be seen from around 120 to 180 at the peak just before
150.

19 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

2.2.8 Diastolic

Figure 2.2-8 Histogram of Diastolic

The peak of the Diastolic column above can be seen ar right above 75 at almost 1200
occurrence.

20 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

2.2.9 Systolic

Figure 2.2-9 Histogram of Systolic

From Figure 2.2-9, we can identify that the peak of Systolic having around 900 occurrences
from the dataset can be seen between approximately 120-140.

21 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

2.2.10 MRW

Figure 2.2-10 Histogram of MRW

The peak of the histogram can be seen with more than 900 occurrences scaling from 110 to
130.

22 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

2.2.11 Smoking

Figure 2.2-11 Histogram of Smoking

Since most of the data include 0 as the Smoking value, it can also be concluded that the
number of non-smoking people is more.

23 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

2.2.12 AgeAtDeath

Figure 2.2-12 Histogram of Age of Death

We can conclude from the histogram above that the majority age of death can be seen
between 60 to more than 80.

24 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

2.2.13 Cholesterol

Figure 2.2-13 Histogram of Cholesterol

We can see that the cholesterol value is the most at around 200 to approximately 270.

25 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

2.2.14 Chol_Status

Figure 2.2-14 Bar Chart of Cholesterol Status

From Figure 2.2-14, the cholesterol status of most of the data is at either borderline level or
high level.

2.2.15 BP_Status

Figure 2.2-15 Pie Chart of Blood Pressure Status

Most of the blood pressure status in the data is at high level at 43.5% followed by normal
level at 41.1%.

26 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

2.2.16 Weight_Status

Figure 2.2-16 Bar Chart of Weight Status

We can also see from the bar chart above that most of the weight status in the data is
Overweight at 3550 occurences.

2.2.17 Smoking_Status

Figure 2.2-17 Bar Chart of Smoking Status

The number of non-smokers can be seen to be the most, similarly from the histogram of
Smoking.

27 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

3. Data Pre-processing
3.1 Incomplete Data
3.1.1 Showing the list of missing value in the dataset
The figures below will show all the missing values in the dataset.

28 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

29 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

Figure 3.1-1 List of missing values

30 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

3.1.2 Replacing the missing value in DeathCause

Figure 3.1-2 Replacing missing value in DeathCause

Since some of the data has the status “Alive”, the row might not have any value in the
DeathCause column, therefore the value is replaced with “Alive”.

Figure 3.1-3 The new frequency table of DeathCause

3.1.3 Replacing the missing value in the Height by the mean and standard
deviation

Figure 3.1-4 Replacing the missing values with the mean and standard deviation

In the case of numerical data, the value is replaced with a sample of data between the MEAN
± STANDARD DEVIATION (Bhandari, 2022). This will range around 68% of the mean
value.

31 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

Figure 3.1-5 Example of Normal Distribution plot

Figure 3.1-6 The new descriptive table of Height

3.1.4 Replacing the missing value in Weight by the mean and standard
deviation

Figure 3.1-7 Replacing the missing values with the mean and standard deviation

The Weight also follows the same principle of replacing the missing values as the Height.

32 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

Figure 3.1-8 The new descriptive table of weight

3.1.5 Replacing the missing value in MRW by the mean and standard
deviation

Figure 3.1-9 Replacing the missing values with the mean and standard deviation

The MRW column also undergoes the same treatment of the missing values by replacing it
with the mean and standard deviation.

Figure 3.1-10 The new descriptive table of MRW

3.1.6 Replacing the missing value in AgeCHDdiag with 0

Figure 3.1-11 Replacing the missing values with 0

Since the missing age of CHD diagnosed is unknown, the values are replaced with 0 that will
mark that it is unknown.

33 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

Figure 3.1-12 The new descriptive table of AgeCHDdiag

3.1.7 Replacing the missing value in Smoking with the Mode Value

Figure 3.1-13 Replacing the missing values with the Mode

In the case of Smoking, the missing values are replaced with the most frequent value which is
0. This is done due to the data having a high standard deviation and mean which can result to
a vary data.

Figure 3.1-14 The new descriptive table of Smoking

3.1.8 Replacing the missing value in Smoking_Status with the Mode Value

Figure 3.1-15 Replacing the missing values with based on the Smoking value

34 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

Since the missing value of the Smoking has been fixed, we can now fixed the
Smoking_Status based on the value of the Smoking following the range specified in the
dataset.

Figure 3.1-16 New descriptive table of Smoking

3.1.9 Replacing the missing value in AgeAtDeath with 0 if Alive

Figure 3.1-17 Replacing the missing value of 'Alive' with 0

If the status is still ‘Alive’ the AgeAtDeath will then be replaced by 0 otherwise the value
will be replaced by the mode which is 68.

Figure 3.1-18 New descriptive table of AgeAtDeath

35 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

3.1.10 Replacing the missing value in Cholesterol by the mean and standard
deviation

Figure 3.1-19 Replacing the missing values with the mean and standard deviation

The Cholesterol missing values are also replaced by the mean and standard deviation
sampling.

Figure 3.1-20 New descriptive table of Cholesterol

3.1.11 Replacing the missing value in Chol_Status based on the Cholesterol

Figure 3.1-21 Showing the range of Cholesterol Status

Figure 3.1-22 Replacing the data based on the range

36 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

Based on the range shown in Figure 3.1-20, we can replace the missing values based on the
Cholesterol value that has been fixed in the previous section.

Figure 3.1-23 New descriptive table of Chol_Status

3.1.12 Replacing the missing value in Weight_Status based on the MRW

Figure 3.1-24 MRW ranges between the Weight_Status

Figure 3.1-25 Replacing the data based on the range

Similar to Chol_Status, the Weight_Status column is fixed by identifying the range of MRW
which directly affect the Weight_Status and replaced it based on the MRW.

37 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

Figure 3.1-26 New descriptive table of Weight_Status

38 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

3.1.13 Data with no missing value

39 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

40 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

Figure 3.1-27 Data with no missing value

3.2 Noisy Data


In detecting noisy data, the first step is to find the outliers in the numerical data. This can be
seen by plotting box plots for each of the numerical columns. In this case, the outliers that are
to be removed are only those with an extreme value – mild outliers will still be left.

41 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

3.2.1 Age CHD Diagnosed Without the 0 Value

Figure 3.2-1 Boxplot of AgeCHDdiag

Since the 0 value of the AgeCHDdiag means that it is unknown, the value is excluded to plot
the box plot. Since the outliers are not extreme, removal of outliers can be ignored.

42 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

3.2.2 Age At Start

Figure 3.2-2 Boxplot of AgeAtStart

As seen from the box plot, the column shows no outlier, hence the column is clean.

43 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

3.2.3 Height

Figure 3.2-3 Boxplot of Height

Figure 3.2-4 Code to replace the outlier

Since only the extreme value is removed, the formula used is

44 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

IQR = Q3-Q1

Upper Limit = Q3 + 3*IQR

Lower Limit = Q1 – 3*IQR

(Outliers, n.d.)

This is different from the common formula of 1.5 * IQR, as the goal is to fix only the extreme
outliers. Q3 and Q1 can be seen from the descriptive statistic table that has been generated
from the previous section which are marked as 25th pctl and 75th pctl.

After limiting the data for the outliers, the value is then replaced with either the upper or
lower limit value.

Figure 3.2-5 New boxplot of Height

3.2.4 Weight

45 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

Figure 3.2-6 Boxplot of Weight

Figure 3.2-7 Code to fix the outliers

46 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

Figure 3.2-8 New boxplot of weight

47 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

3.2.5 Diastolic

Figure 3.2-9 Boxplot of Diastolic

Figure 3.2-10 Code to fix the outliers

48 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

Figure 3.2-11 New boxplot of Diastolic

49 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

3.2.6 Systolic

Figure 3.2-12 Boxplot of Systolic

Figure 3.2-13 Code to fix the outliers

50 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

Figure 3.2-14 New boxplot of Systolic

51 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

3.2.7 MRW

Figure 3.2-15 Boxplot of MRW

Figure 3.2-16 Code to fix the outliers

52 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

Figure 3.2-17 New boxplot of MRW

53 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

3.2.8 Smoking

Figure 3.2-18 Boxplot of Smoking

Figure 3.2-19 Code to fix the outliers

54 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

Figure 3.2-20 New boxplot of Smoking

55 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

3.2.9 Age At Death Without the 0 Value

Figure 3.2-21 Boxplot of Age At Death

As the 0 value refers to the status being alive, the value is ignored during the production of
the box plot. No extreme outlier is detected based on the figure above.

56 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

3.2.10 Cholesterol

Figure 3.2-22 Boxplot of Cholesterol

Figure 3.2-23 Code to fix the outliers

57 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

Figure 3.2-24 New boxplot of Cholesterol

3.2.11 Explanation
As seen from all the new box plots, outliers can still be seen but since the goal of fixing this
section is to just fix the extreme value to retain the originality of the data, some mild outliers
can still be detected. This outlier will not reduce the accuracy in the further analysis.

3.3 Inconsistent Data


Inconsistent data can be defined as data having wrong format throughout the values of the
column or having wrong data types.

58 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

Figure 3.3-1 No Attribute of the Wrong Data Type

Since there is no wrong data type across the attributes and as seen from Section 2.1 there is
also no incorrect data throughout the value of the dataset; no changes will be made on this
part.

The only format difference can be seen on the Height column.

59 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

Figure 3.3-2 Height Column

As seen above, there is no standard in the value of the column as it shows value with different
decimal points. Moreover, compared to the other numerical column of the same type, namely
Weight, MRW, etc; only Height column contains decimal value. Therefore a standardization
will be made by rounding the Height to the nearest integer

Figure 3.3-3 Changing the format of the column

60 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

Figure 3.3-4 New format of the Height Column

3.4 Data Transformation


3.4.1 Adding a unique ID for each row
Since there is no primary key in the data, a unique ID column is generated to tackle the issue.

Figure 3.4-1 Code to generate a unique ID

61 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

Figure 3.4-2 A new 'ID' column created

3.4.2 Adding a new column for the range of Age CHD Diagnosed
For all the Age attributes, a new column is created to show the range and status of the age to
assist the analysis. The age is then grouped based on the common age grouping of Child,
Young Adults, Middle-aged Adults, and Senior.

Figure 3.4-3 Code to generate a new column for AgeCHDStat

As it is specified while fixing the mixing value, the 0 value is classified as “Unknown”.

62 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

Figure 3.4-4 New column created for AgeCHDStat

3.4.3 Adding a new column for the range of Age At Start

Figure 3.4-5 Code to generate a new column for AgeAtStart

63 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

Figure 3.4-6 New column of AgeStartStat

3.4.4 Adding a new column for the range of Age At Death

Figure 3.4-7 Code to generate new column of AgeDeathStat

As specified while fixing the missing value, the 0 value refers to the person being alive,
hence the labeling.

64 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

Figure 3.4-8 New column of AgeDeathStat

4. Visualization and Analysis of the Processed Data


4.1 Final Data Set
After the data set is processed, the final and new data set can be seen from the snippet below.

65 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

Figure 4.1-1 Sample of final data set

66 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

4.2 DeathCause by Status

Figure 4.2-1 Bar Chart of DeathCause based on Status

As seen from the bar chart above, apart from the Alive status, we can see that coronary heart
disease contribute the most as the death cause at 605.

67 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

4.3 AgeStartStat

Figure 4.3-1 Bar Chart of AgeStartStat

As seen from the figure above, most of the data in age at start is in Middle-aged Adults
followed by senior and then Young Adults.

68 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

4.4 AgeCHDStat

Figure 4.4-1 Bar Chart of AgeCHDStat

As seen from the figure above, apart from the unknown data, CHD is mostly diagnosed in
Senior age which is above 50.

69 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

4.5 AgeDeathStat

Figure 4.5-1 Bar Chart of AgeDeathStat

The Age of death is also mostly seen in Senior class with age above 50 at 1928 count.

70 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

4.6 DeathCause group by AgeDeathStat

Figure 4.6-1 Bar Chart of Death Cause group by the Age of Death

Filtering out the subjects that are still alive, we can see that the highest cause of death for
senior citizen is on coronary heart disease followed by Cancer. While for Middle-aged
Adults, the 2nd highest death cause can be seen on Cancer with the 1st being other causes.

71 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

4.7 Number of patients with CHD diagnosed group by Status

Figure 4.7-1 Number of patients with CHD diagnosed group by Status

From the figure above, we can conclude that more than half of the people who have been
diagnosed with CHD are death, mostly in the Senior age at 814 counts.

72 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

4.8 Weight and Height Group By Sex

Figure 4.8-1 Scatter plot of Height and Weight Group by Sex

As seen from the scatter plot, the Female data dominates the lower left of the data, meaning
that they have smaller Height and Weight overall while the Male data dominates the other
part. Despite that we can see that the largest Weight can be seen at almost 300 with height
between 60-65.

73 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

4.9 Weight Status Group by Sex

Figure 4.9-1 Bar Chart of Weight Status Group by Sex

We can conclude that most of the people in the data are overweight with female having the
highest count at 1909 overweight cases.

74 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

4.10 Systolic and Diastolic Group By Blood Pressure Status

Figure 4.10-1 Systolic and Diastolic Group by Blood Pressure Status

From the figure above, we can conclude that the higher the Diastolic and Systolic, the higher
the Blood Pressure will be. The optimal Diastolic is around 70 with Systolic at approximately
100 to 110.

4.11 Correlation between Height, Weight and MRW

Figure 4.11-1 Correlation between Height, Weight, and MRW

75 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

From the correlation table above, we can conclude that the factor that contributes the most to
MRW is Weight at 0.76 while Height at -0.13 showing a negative correlation. Weight and
Height also have a mild correlation of 0.51.

4.12 Weight and MRW Group by Sex

Figure 4.12-1 Scatter Plot between MRW and Weight Group by Sex

We can see that Male has a higher overall Weight throughout the data compared to Female.
We can also see that the higher the weight is the higher the MRW.

76 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

4.13 Height and Weight Group by Weight Status

Figure 4.13-1 Scatter plot of Height and Weight Group by Weight Status

From the scatter plot above we can clearly see the difference in the weight status as the
Overweight data having the most frequency.

77 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

4.14 Cholesterol Level Compared to Age of Death

Figure 4.14-1 Heatmap of Cholesterol level compared to Age of Death

The heatmap above shows that the most frequent age of death can be seen ranging from 60-
80, while having cholesterol level from 200 to just above 250.

78 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

4.15 Smoking Level and Cholesterol Compared to Age of Death

Figure 4.15-1 Heatmap of Smoking level and Cholesterol compared to Age of Death

The heatmap above shows the age, smoking, and cholesterol range. As seen from above, the
higher the smoking level, the higher the Cholesterol is and with both of that attributes high,
the age of death can even be lower than the average.

4.16 Relationship between blood pressure, Cholesterol, and MRW


group by Sex

79 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

Figure 4.16-1 Relationship between blood pressure, Cholesterol, and MRW group by Sex

We can conclude from the figure above that the higher the MRW – higher weight status
which can lead to overweight, and Cholesterol, the higher the blood pressure is. The graph
shows slightly higher Cholesterol and MRW rate in the female data.

5. Conclusion
The paper has clearly shown the before and after pre-processing of the data set. Some
changes have been made to process the data for a more accurate analysis by replacing the
missing value and also the outliers that can be found throughout the numerical data set. The
more accurate analysis is also shown by plotting the graph based on the new data set. Overall,
the paper has fulfilled all the requirements of the assignment.

80 | P a g e
CT075-3-2-DTM INDIVIDUAL ASSIGNMENT APD2F2200CSDA

6. References
Bhandari, P. (2022, September 26). How to Calculate the Standard Deviation | Formula,
Meaning & Examples. Scribbr. https://www.scribbr.co.uk/stats/standard-deviation-
meaning/
Outliers. (n.d.). https://bolt.mph.ufl.edu/6050-6052/unit-1/one-quantitative-variable-
introduction/understanding-outliers/

81 | P a g e

You might also like