Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

ASSIGNMENT-1(MA21M011-PRIYA)

1. On a Volleyball team the players are drawn from a pool in which the career average free throw
percentage follows a N(60, 25) distribution. In a given year individual players free throw percentage is
N(θ, 9) where θ is their career average. This season Mamata made 80 percent of her free throws. What
is the posterior expected value of her career percentage θ?
2.Consider the diabetes data set which is available in the UC Irvine Machine Learning Reposi tory.
http://archive.ics.uci.edu/ml/datasets/Diabetes. Analyse the data using PCA and LDA and write two three
report based on the analysis.

Report 1: PCA Analysis of Diabetes Data Set

In this report, we will use PCA to analyse the diabetes data set from the UC Irvine Machine Learning
Repository. PCA is a dimensionality reduction technique that is used to find a lower-dimensional
representation of the data that captures most of the variation in the original data

The diabetes data set consists of 8variables: age, body mass index, blood pressure, Skin thickness, insulin
level, glucose level, diabetes pedigree function, and pregnancies. The target variable is a quantitative
measure of disease progression one year after baseline. There are 768 instances in the data set.

We first standardized the data by subtracting the mean of each variable and dividing by the standard
deviation. We then applied PCA to the standardized data set. The scree plot showed that the first two
principal components accounted for about 60% of the total variance in the data. Therefore, we plotted
the data in the two-dimensional space defined by the first two principal components.

The PCA plot showed that the data points were spread out in a roughly elliptical shape. The two principal
components were positively correlated with each other, which indicates that there is a general trend in
the data that is captured by both components. The PCA plot did not reveal any clear clustering of the
data points based on the target variable, which suggests that the disease progression measure is not
strongly related to the other variables in the data set.

Report 2: LDA Analysis of Diabetes Data Set

LDA is a supervised classification technique that is used to find a linear combination of variables that
maximally separates the data into different classes. In this report, we will use LDA to analyse the diabetes
data set from the UC Irvine Machine Learning Repository.

The diabetes data set consists of 8 variables: age, body mass index, blood pressure, Skin thickness,
insulin level, glucose level, diabetes pedigree function, and pregnancies. The target variable is a
quantitative measure of disease progression one year after baseline. There are 768 instances in the data
set.

We first standardized the data by subtracting the mean of each variable and dividing by the standard
deviation. We then applied LDA to the standardized data set. The LDA plot showed that the data points
were well separated based on the target variable. The two classes were roughly centered around the
mean of the LDA variable, with little overlap between them.We also used the leave-one-out cross-
validation method to estimate the classification error rate of the LDA model. The estimated error rate
was 31%, which suggests that the LDA model has some predictive power, but it is not highly accurate.
We further examined the LDA coefficients to determine which variables were most important for
separating the two classes. The variables that had the largest absolute coefficients were body mass
index, serum glucose level, and age.

Hence, LDA analysis showed that the diabetes data set can be effectively separated into two classes
based on the target variable. The analysis also revealed that body mass index, glucose level, and age
are the most important variables for predicting the target variable. However, the LDA model is not highly
accurate, and more sophisticated modeling techniques may be required to accurately predict disease
progression based on the other variables in the data set.

Report 3: PCA Analysis of Diabetes Data Set

In this analysis, we first standardize the data using the StandardScaler function from the
sklearn.preprocessing package. Then, we apply PCA to the standardized data, and plot the variance
explained by each principal component.

We then plot the scores of the first two principal components against each other, and color-code the
points according to the diagnosis of diabetes. The first principal component explains 26.08% of the
variance, the second principal component explains 16.29% of the variance, and the third principal
component explains 12.41% of the variance. The scree plot shows that the first three principal
components explain approximately 54.78% of the total variance in the data.We observe that the patients
who developed diabetes are scattered throughout the plot, and there is no clear separation between
the two groups. This indicates that it may be difficult to use PCA to distinguish between patients who
will or will not develop diabetes.

Report 4: LDA Analysis of Diabetes Data Set

In this analysis, we first split the data into a training set and a test set. We use the training set to fit an
LDA model, and then evaluate the performance of the model on the test set.

The LDA model correctly classifies 75.13% of the patients in the test set. The confusion matrix shows
that the model tends to predict that patients will not develop diabetes. This may be because the dataset
is imbalanced, with only 35% of the patients developing diabetes. To address this issue, we can use
techniques such as oversampling or under sampling to balance the dataset. We can also plot the
decision boundaries of the LDA model, which show the regions of the feature space that correspond to
each class. We observe that the decision boundaries are linear, which indicates that the features are
separable to some extent. However, there is still some overlap between the two classes, which may make
it difficult to achieve high classification accuracy.

Conclusion: In this analysis, we used PCA and LDA to explore the relationships
between the features and the diagnosis of diabetes in the diabetes dataset. We
found that PCA was not effective at distinguishing between patients who will or
will not develop diabetes, while LDA was able to achieve a classification accuracy
of 75.13% on the test set. However, the dataset is imbalanced and the features
are not fully separable, which may limit the accuracy of the classification model.
Further research is needed to determine if other techniques can be used to
improve the classification accuracy of this dataset.

You might also like