Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

BI12-215 Đàm Hữu Khoa

BI12-206 Lê Quang Khánh


REPORT LABWORK 1
I, Iris Dataset
1, Study the dataset
- This dataset comprises instances of iris plants, with each instance representing a specific plant
Features table:

Variable Name Role Type

sepal length Feature Continuous

sepal width Feature Continuous

petal length Feature Continuous


petal width Feature Continuous
1. Sepal Length (Feature):
 Type: Continuous
 Nature: Quantitative
 Explanation: Sepal length is a continuous variable that can take any numerical value within a
range. It is a quantitative measure representing the length of the sepals in centimeters.
2. Sepal Width (Feature):
 Type: Continuous
 Nature: Quantitative
 Explanation: Sepal width is a continuous variable, representing the width of the sepals in
centimeters. Like sepal length, it is a quantitative measure.
3. Petal Length (Feature):
 Type: Continuous
 Nature: Quantitative
 Explanation: Petal length is a continuous variable, indicating the length of the petals in
centimeters. It is a quantitative measure.
4. Petal Width (Feature):
 Type: Continuous
 Nature: Quantitative
 Explanation: Petal width is a continuous variable, representing the width of the petals in
centimeters. It is a quantitative measure.
Mean of features:
sepal length 5.843333
sepal width 3.054
petal length 3.758667
petal width 1.198667
Variance of features:
Sepal length 0.685694
Sepal width 0.188004
Petal length 3.113179
Petal width 0.582414

Covariance matrix of features:


Sepal length Sepal width Petal length Petal width
sepallength 0.685694 -0.039268 1.273682 0.516904
sepalwidth -0.039268 0.188004 -0.321713 -0.117981
petallength 1.273682 -0.321713 3.113179 1.296387
petalwidth 0.516904 -0.117981 1.296387 0.582414
Correlation matrix of features:
Sepal length sepal width petal length petal width
sepal length 1 -0.109369 0.871754 0.817954
sepal width -0.109369 1 -0.420516 -0.35654
petal length 0.871754 -0.420516 1 0.962757
petal width 0.817954 -0.356544 0.962757 1

- The highest correlation coefficient is between 'petal length' and 'petal width', which is approximately
0.962757. Therefore, the most correlated pair of features in the Iris dataset is 'petal length' and 'petal width'.
- The high correlation between 'petal length' and 'petal width' suggests a strong positive linear relationship
between these two features. In practical terms, it means that as the length of the petal increases, the width
also tends to increase, and vice versa. This could indicate that these two features are related and may carry
similar information about the dataset.
2, PCA
- Apply PCA:

#Applying PCA
from sklearn.decomposition import PCA
#creating a PCA object
pca = PCA(n_components=2)
#fitting the features
pca.fit(features)
#finalizing the transformed data
features = pca.transform(features)
#let's check the shape of X_pca array
features.shape
(150, 2)

- Explained Variance:
+ Explained Variance Ratio for the first component: 0.7596003257714182
+ Explained Variance Ratio for the second component: 0.2403996742285818
+ Cumulative Sum of Explained Variance Ratio for the first two components: 1.0

- The first principal component (PC1) explains approximately 75.96% of the total variance in the data,
representing a significant portion of the dataset's information.
- The second principal component (PC2) contributes approximately 24.04% of the total variance.
- The cumulative sum of the explained variance ratio for the first two components is 100%, indicating that
these two principal components collectively account for the entire variability in the original dataset.
- Visualize data distribution:
- The graph obtained through the PCA visualization of the Iris dataset demonstrates the separation of classes
using the first two principal components. In this case, Iris Setosa appears to be well-separated from the other
two classes (Iris Versicolour and Iris Virginica). The distinct cluster of data points corresponding to Iris
Setosa suggests that its features are sufficiently different from the other classes when projected onto the two
principal components.
- On the other hand, the data points for Iris Versicolour and Iris Virginica overlap to some extent, indicating
that their feature distributions are not entirely separable based on these two principal components alone.
While there is a degree of separation, it's not as pronounced as with Iris Setosa.
- In summary, the individual classes are separated to varying degrees in the two-dimensional subspace
defined by the first two principal components. The graph provides a visual representation of the relative
distinctness of each class in this reduced feature space.

Increase number of components (n_components=3):


-Explained variance:
+ Explained Variance Ratio for the first component: 0.7314730163810985
+ Explained Variance Ratio for the second component: 0.2314978928773266
+ Cumulative Sum of Explained Variance Ratio for the first two components: 0.9629709092584251

The cumulative explained variance ratio for the first two components was already quite high (96.30%), the
additional variance explained by the third component was relatively small (about 3.70%).
- Visualize data distribution:
II, Heart disease
1, Study the Dataset
- This dataset’s field refers to the presence of heart disease in the patient
Features table

Variable Name Role Type

age Feature Continuous


sex Feature Discrete
cp Feature Discrete

trestbps Feature Continuous

chol Feature Continuous


fbs Feature Discrete
restecg Feature Discrete
thalach Feature Continuous
exang Feature Discrete
oldpeak Feature Continuous
slope Feature Discrete
ca Feature Continuous
thal Feature Discrete
1. Age (Feature):
 Type: Continuous
 Nature: Quantitative
 Explanation: Age is a continuous variable representing the age of individuals. It is a
quantitative measure in years.
2. Sex (Feature):
 Type: Categorical
 Nature: Qualitative
 Explanation: Sex is a categorical variable indicating the gender of individuals. It takes on
discrete values, typically 'male' or 'female'. It is a qualitative measure.
3. Chest Pain Type (cp) (Feature):
 Type: Categorical
 Nature: Qualitative
 Explanation: Chest Pain Type is a categorical variable representing the type of chest pain. It
takes on discrete values, and the specific categories are not provided. It is a qualitative
measure.
4. Resting Blood Pressure (trestbps) (Feature):
 Type: Continuous
 Nature: Quantitative
 Explanation: Resting Blood Pressure is a continuous variable representing the blood
pressure of individuals at rest. It is a quantitative measure in mm Hg.
5. Serum Cholestoral (chol) (Feature):
 Type: Continuous
 Nature: Quantitative
 Explanation: Serum Cholestoral is a continuous variable representing the level of serum
cholesterol in mg/dl. It is a quantitative measure.
6. Fasting Blood Sugar (fbs) (Feature):
 Type: Categorical
 Nature: Qualitative
 Explanation: Fasting Blood Sugar is a categorical variable indicating whether an individual
has fasting blood sugar greater than 120 mg/dl. It takes on discrete values, possibly 'yes' or
'no'. It is a qualitative measure.
7. Resting Electrocardiographic Results (restecg) (Feature):
 Type: Categorical
 Nature: Qualitative
 Explanation: Resting Electrocardiographic Results is a categorical variable representing the
results of the resting electrocardiogram. Specific categories are not provided. It is a
qualitative measure.
8. Maximum Heart Rate Achieved (thalach) (Feature):
 Type: Continuous
 Nature: Quantitative
 Explanation: Maximum Heart Rate Achieved is a continuous variable representing the
highest heart rate achieved. It is a quantitative measure.
9. Exercise Induced Angina (exang) (Feature):
 Type: Categorical
 Nature: Qualitative
 Explanation: Exercise Induced Angina is a categorical variable indicating whether angina is
induced by exercise. It takes on discrete values, possibly 'yes' or 'no'. It is a qualitative
measure.
10. ST Depression Induced by Exercise (oldpeak) (Feature):
 Type: Continuous
 Nature: Quantitative
 Explanation: ST Depression Induced by Exercise is a continuous variable representing the
ST depression induced by exercise relative to rest. It is a quantitative measure.
11. Slope (slope) (Feature):
 Type: Categorical
 Nature: Qualitative
 Explanation: Slope is a categorical variable representing the slope. Specific categories are
not provided. It is a qualitative measure.
12. Number of Major Vessels Colored by Flourosopy (ca) (Feature):
 Type: Integer
 Nature: Quantitative
 Explanation: Number of Major Vessels Colored by Flourosopy is a quantitative variable
representing the count of major vessels (0-3) colored by flourosopy.
13. Thal (thal) (Feature):
 Type: Categorical
 Nature: Qualitative
 Explanation: Thal is a categorical variable, and specific categories are not provided. It is a
qualitative measure.
Mean of features:
age 54.438944
sex 0.679868
cp 3.158416
trestbps 131.689769
chol 246.693069
fbs 0.148515
restecg 0.990099
thalach 149.607261
exang 0.326733
oldpeak 1.039604
slope 1.60066
ca 0.672241
thal 4.734219

Variance of features:

age 81.697419
sex 0.218368
cp 0.921841
trestbps 309.75112
chol 2680.84919
fbs 0.126877
restecg 0.989968
thalach 523.265775
exang 0.220707
oldpeak 1.348095
slope 0.379735
ca 0.878791
thal 3.762458

Covariance matrix of features:

age sex cp trestbps chol fbs


age 81.697419 -0.411995 0.903744 45.328678 97.78749 0.381614
sex -0.411995 0.218368 0.004524 -0.530107 -4.83699 0.007967
cp 0.903744 0.004524 0.921841 -0.609632 3.595141 -0.013671
trestbps 45.328678 -0.530107 -0.609632 309.75112 118.5733 1.099207
chol 97.787489 -4.836994 3.595141 118.573339 2680.849 0.181496
fbs 0.381614 0.007967 -0.013671 1.099207 0.181496 0.126877
restecg 1.338797 0.010065 0.064488 2.566455 8.811521 0.024654
thalach -81.423065 -0.520184 -7.344863 -18.258005 -4.06465 -0.063996
exang 0.38922 0.032096 0.173235 0.535473 1.491345 0.004295
oldpeak 2.13885 0.055436 0.225493 3.865638 2.799282 0.002377
slope 0.901034 0.010808 0.089961 1.273053 -0.1296 0.013147
ca 3.066396 0.040964 0.210512 1.639436 5.791385 0.048394
thal 2.240487 0.34495 0.494972 4.57381 1.429834 0.048981

restecg thalach exang oldpeak slope ca


age 1.338797 -81.423065 0.38922 2.13885 0.901034 3.066396
sex 0.010065 -0.520184 0.032096 0.055436 0.010808 0.040964
cp 0.064488 -7.344863 0.173235 0.225493 0.089961 0.210512
trestbps 2.566455 -18.258005 0.535473 3.865638 1.273053 1.639436
chol 8.811521 -4.064651 1.491345 2.799282 -0.1296 5.791385
fbs 0.024654 -0.063996 0.004295 0.002377 0.013147 0.048394
restecg 0.989968 -1.897941 0.03967 0.13185 0.082126 0.119706
thalach -1.897941 523.265775 -4.063307 -9.112209 -5.4355 -5.68627
exang 0.03967 -4.063307 0.220707 0.157216 0.074618 0.064162
oldpeak 0.13185 -9.112209 0.157216 1.348095 0.413219 0.322753
slope 0.082126 -5.435501 0.074618 0.413219 0.379735 0.063747
ca 0.119706 -5.68627 0.064162 0.322753 0.063747 0.878791
thal 0.047342 -12.399734 0.300155 0.769517 0.343688 0.466694

thal
age 2.240487
sex 0.34495
cp 0.494972
trestbps 4.57381
chol 1.429834
fbs 0.048981
restecg 0.047342
thalach -12.399734
exang 0.300155
oldpeak 0.769517
slope 0.343688
ca 0.466694
thal 3.762458

Correlation matrix of features:

age sex cp trestbps chol fbs


age 1 -0.097542 0.104139 0.284946 0.20895 0.11853
sex -0.097542 1 0.010084 -0.064456 -0.19992 0.047862
cp 0.104139 0.010084 1 -0.036077 0.072319 -0.039975
trestbps 0.284946 -0.064456 -0.036077 1 0.13012 0.17534
chol 0.20895 -0.199915 0.072319 0.13012 1 0.009841
fbs 0.11853 0.047862 -0.039975 0.17534 0.009841 1
restecg 0.148868 0.021647 0.067505 0.14656 0.171043 0.069564
thalach -0.393806 -0.048663 -0.334422 -0.045351 -0.00343 -0.007854
exang 0.091661 0.146201 0.38406 0.064762 0.06131 0.025665
oldpeak 0.203805 0.102173 0.202277 0.189171 0.046564 0.005747
slope 0.16177 0.037533 0.15205 0.117382 -0.00406 0.059894
ca 0.362605 0.093185 0.233214 0.098773 0.119 0.145478
thal 0.127389 0.380936 0.265246 0.133554 0.014214 0.071358

restecg thalach exang oldpeak slope ca thal


age 0.148868 -0.393806 0.091661 0.203805 0.16177 0.362605 0.127389
sex 0.021647 -0.048663 0.146201 0.102173 0.037533 0.093185 0.380936
cp 0.067505 -0.334422 0.38406 0.202277 0.15205 0.233214 0.265246
trestbps 0.14656 -0.045351 0.064762 0.189171 0.117382 0.098773 0.133554
chol 0.171043 -0.003432 0.06131 0.046564 -0.00406 0.119 0.014214
fbs 0.069564 -0.007854 0.025665 0.005747 0.059894 0.145478 0.071358
restecg 1 -0.083389 0.084867 0.114133 0.133946 0.128343 0.024531
thalach -0.083389 1 -0.378103 -0.343085 -0.3856 -0.264246 -0.27963
exang 0.084867 -0.378103 1 0.288223 0.257748 0.14557 0.32968
oldpeak 0.114133 -0.343085 0.288223 1 0.577537 0.295832 0.341004
slope 0.133946 -0.385601 0.257748 0.577537 1 0.110119 0.287232
ca 0.128343 -0.264246 0.14557 0.295832 0.110119 1 0.256382
thal 0.024531 -0.279631 0.32968 0.341004 0.287232 0.256382 1

- The most correlated pair of features is 'thalach' (maximum heart rate achieved) and 'exang' (exercise-
induced angina), with a correlation coefficient of approximately -0.378103.
The negative correlation between 'thalach' and 'exang' suggests an inverse relationship. In practical terms, it
implies that as the maximum heart rate achieved during exercise decreases, the likelihood of experiencing
exercise-induced angina increases, and vice versa. This information can be valuable in understanding the
relationship between these two features and their potential impact on heart health.
2, PCA

# Standardize the features


scaler = StandardScaler()
scaled_features = scaler.fit_transform(X)

# Applying PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(scaled_features)

# Check the shape of the transformed features


print("Shape of features after PCA:", X_pca.shape)
Shape of features after PCA: (303, 2)
- Explained variance:
+ Explained Variance Ratio for the first component: 0.23668956528475113
+ Explained Variance Ratio for the second component: 0.12299426418404227
+ Cumulative Sum of Explained Variance Ratio for the first two components: 0.3596838294687934

- PC1 (23.67%): This component captures a significant portion of the overall variability in the data. It
represents the direction in which the data varies the most.
- PC2 (12.30%): While PC2 contributes less to the total variance compared to PC1, it still captures
additional patterns or directions of variability orthogonal to PC1.
- Cumulative Sum (35.97%): The cumulative sum indicates the proportion of total variance explained by
the combined information from both PC1 and PC2. In your case, these two components together explain
about 35.97% of the total variance.
- The relatively low cumulative sum suggests that the first two principal components do not capture a large
portion of the total variance in the data. This might imply that the dataset has complex patterns that are not
well-represented by a small number of principal components.
- Visualize data distribution:
- The visual representation of the dataset using the first two principal components suggests that the classes
are not well-separated in this reduced two-dimensional subspace. It appears that the data points
corresponding to the classes labeled 0, 1, 2, 3, and 4 are mixed together without clear boundaries between
them.
- The lack of distinct separation between classes indicates that the information captured by the first two
principal components may not be sufficient to clearly discriminate between the different classes in the
original feature space. This aligns with the observation from the explained variance ratios, where the first
two components explained only a modest portion of the total variance in the data

Increase number of components (n_components=3):


-Explained variance:
+ Explained Variance Ratio for the first component: 0.23668956528475113
+ Explained Variance Ratio for the second component: 0.12299426418404227
+ Cumulative Sum of Explained Variance Ratio for the first two components: 0.3596838294687934
- With three principal components, the explained variance ratios for the first, second, and third components
are approximately 0.237, 0.123, and the cumulative sum of explained variance for the first two components
is around 0.360.
- This implies that the additional third component contributes to explaining more of the total variance in the
dataset. However, the cumulative sum for the first two components still remains relatively low, indicating
that a substantial portion of the dataset's variance is not captured by these components.
- Visualize data distribution:

You might also like