Professional Documents
Culture Documents
LAB1
LAB1
- The highest correlation coefficient is between 'petal length' and 'petal width', which is approximately
0.962757. Therefore, the most correlated pair of features in the Iris dataset is 'petal length' and 'petal width'.
- The high correlation between 'petal length' and 'petal width' suggests a strong positive linear relationship
between these two features. In practical terms, it means that as the length of the petal increases, the width
also tends to increase, and vice versa. This could indicate that these two features are related and may carry
similar information about the dataset.
2, PCA
- Apply PCA:
#Applying PCA
from sklearn.decomposition import PCA
#creating a PCA object
pca = PCA(n_components=2)
#fitting the features
pca.fit(features)
#finalizing the transformed data
features = pca.transform(features)
#let's check the shape of X_pca array
features.shape
(150, 2)
- Explained Variance:
+ Explained Variance Ratio for the first component: 0.7596003257714182
+ Explained Variance Ratio for the second component: 0.2403996742285818
+ Cumulative Sum of Explained Variance Ratio for the first two components: 1.0
- The first principal component (PC1) explains approximately 75.96% of the total variance in the data,
representing a significant portion of the dataset's information.
- The second principal component (PC2) contributes approximately 24.04% of the total variance.
- The cumulative sum of the explained variance ratio for the first two components is 100%, indicating that
these two principal components collectively account for the entire variability in the original dataset.
- Visualize data distribution:
- The graph obtained through the PCA visualization of the Iris dataset demonstrates the separation of classes
using the first two principal components. In this case, Iris Setosa appears to be well-separated from the other
two classes (Iris Versicolour and Iris Virginica). The distinct cluster of data points corresponding to Iris
Setosa suggests that its features are sufficiently different from the other classes when projected onto the two
principal components.
- On the other hand, the data points for Iris Versicolour and Iris Virginica overlap to some extent, indicating
that their feature distributions are not entirely separable based on these two principal components alone.
While there is a degree of separation, it's not as pronounced as with Iris Setosa.
- In summary, the individual classes are separated to varying degrees in the two-dimensional subspace
defined by the first two principal components. The graph provides a visual representation of the relative
distinctness of each class in this reduced feature space.
The cumulative explained variance ratio for the first two components was already quite high (96.30%), the
additional variance explained by the third component was relatively small (about 3.70%).
- Visualize data distribution:
II, Heart disease
1, Study the Dataset
- This dataset’s field refers to the presence of heart disease in the patient
Features table
Variance of features:
age 81.697419
sex 0.218368
cp 0.921841
trestbps 309.75112
chol 2680.84919
fbs 0.126877
restecg 0.989968
thalach 523.265775
exang 0.220707
oldpeak 1.348095
slope 0.379735
ca 0.878791
thal 3.762458
thal
age 2.240487
sex 0.34495
cp 0.494972
trestbps 4.57381
chol 1.429834
fbs 0.048981
restecg 0.047342
thalach -12.399734
exang 0.300155
oldpeak 0.769517
slope 0.343688
ca 0.466694
thal 3.762458
- The most correlated pair of features is 'thalach' (maximum heart rate achieved) and 'exang' (exercise-
induced angina), with a correlation coefficient of approximately -0.378103.
The negative correlation between 'thalach' and 'exang' suggests an inverse relationship. In practical terms, it
implies that as the maximum heart rate achieved during exercise decreases, the likelihood of experiencing
exercise-induced angina increases, and vice versa. This information can be valuable in understanding the
relationship between these two features and their potential impact on heart health.
2, PCA
# Applying PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(scaled_features)
- PC1 (23.67%): This component captures a significant portion of the overall variability in the data. It
represents the direction in which the data varies the most.
- PC2 (12.30%): While PC2 contributes less to the total variance compared to PC1, it still captures
additional patterns or directions of variability orthogonal to PC1.
- Cumulative Sum (35.97%): The cumulative sum indicates the proportion of total variance explained by
the combined information from both PC1 and PC2. In your case, these two components together explain
about 35.97% of the total variance.
- The relatively low cumulative sum suggests that the first two principal components do not capture a large
portion of the total variance in the data. This might imply that the dataset has complex patterns that are not
well-represented by a small number of principal components.
- Visualize data distribution:
- The visual representation of the dataset using the first two principal components suggests that the classes
are not well-separated in this reduced two-dimensional subspace. It appears that the data points
corresponding to the classes labeled 0, 1, 2, 3, and 4 are mixed together without clear boundaries between
them.
- The lack of distinct separation between classes indicates that the information captured by the first two
principal components may not be sufficient to clearly discriminate between the different classes in the
original feature space. This aligns with the observation from the explained variance ratios, where the first
two components explained only a modest portion of the total variance in the data