Professional Documents
Culture Documents
Regression Analysis of Gapminder Data
Regression Analysis of Gapminder Data
import pandas as pd
import numpy as np
# pd.set_option('display.max_rows', None)
# use the local .tsv file and use the pd.read to get the dataframe and viewe the first 5 rows
data.head()
In [2]:
# %matplotlib inline
plt.xlabel("Year")
plt.ylabel("Life Expectancy")
plt.grid()
plt.show()
Question
expectancy 1: Is there
across a general
time? Is trend
this (e.g.,
trend increasing
linear? or
(answeringdecreasing)
this for life
qualitatively from
the plot, you will do a statistical analysis of this question shortly)
We see that there is a general trend upwards in life expectancy(LE) as time goes on. In each decade, the lowest life expectancy (other
than the outliers in the late 1970s and in the 1990s) and the highest life expectancy goes up. So the range (top LE - lowest LE)
remains roughly the same, but in an absolute sense the life expectancy is going up.
In this scatter plot below (with all the data points), this is more clear since the white space under the points increases, indicating that
the y-values (the life expectancies) of the datapoints are going up as the years go by.
In [3]:
plt.plot(data["year"], data["lifeExp"])
plt.xlabel("Year")
plt.ylabel("Life Expectancy")
plt.grid()
plt.show()
In [4]:
years = data["year"].unique()
# for each year in the data frame we'll make a separate list of Life Expectancy values and add them
# to the collections list of lists. We will then make a violin plot out of them.
collections = []
# data[data.year == 2007]
# now that data_year has rows where year is year, we will add to LEs all LE values.
# this list will hold all the life expectancy(LE) values in the df where year is year
LEs = data_year['lifeExp'].tolist()
# now that LEs has all the LE values for when year is year, we will add this list to collections
collections.append(LEs)
# now that collections has the data needed to make a violin plot, we will do that
# violin_plot = ax.violinplot(collections)
fig, ax = plt.subplots()
ax.violinplot(collections,years,widths=4,showmeans=True)
ax.set_xlabel("Year")
ax.set_ylabel("Life Expectancy")
As we can see from the violin plot above as well, the life expectancy is going up as the years go by. The violins for each successive
year are more clustered at the top indicating that a higher amount of people in each year will reach the high ends of the life
expectancy range for that year. The top and bottom of the violins also go up over time (so the window of highest LEs and lowest LEs
is going up over time too).
Question
countries 2: How
for would
individual you describe
years? Is it the distribution
skewed, or not? of life
Unimodal expectancy
or not? across
Symmetric
around it’s center?
The distribution of LE across years for each country changes from being skewed "down" to being symmetrical to being skewed "up."
To be clear, In 1952 and 1957, most countries had LEs in the lower end. Then, in 1962,1967, and 1972, the distribution appears more
symmetric in that there appear to be two modes and most countries have either the low or high ends of LEs (either the 60-70 or 30-
50 range).
Around 1977 we see that the data is beginning to be skewed "up," in that there are more countries that have LEs in the upper end of
the LEs for the year than countries that have the lower LEs for that year. As the years go by, this appears more pronounced. In 2007,
for example, it's clear that the majority of countries have LEs in the 60-80 range. Also after 1977, the distribution appears fairly
unimodal with most countries having LEs in the 60-80 range with the mode being around 70 (from a quick glance at the violin plot).
Question 3: Suppose I fit a linear regression model of life expectancy vs. year
file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 4/41
(treating it as a continuous variable), and test for a relationship between year and
3/18/22, 4:16 PM p3code
In [6]:
# set x and y values to simply be the list of all year values (as they occur) and the lifeExp values
reg = linear_model.LinearRegression()
reg.fit(X, Y)
reg.coef_
Out[6]: array([[0.32590383]])
In [7]:
# extract the intercept and slope of the regression.
[intercept] = reg.intercept_.tolist()
[[slope]] = reg.coef_.tolist()
In [8]:
print("The equation is y = " + str(slope) + "* x + (" + str(intercept) + ")")
I would reject the null hypothesis because a slope of approximately 0.33 is substantial and indicates that there is likely a positive
correlation betweeen year and life expectancy (LE).
Exercise 3:
Exercise 2. Make a violin plot of residuals vs. year for the linear model from
In [9]:
# We will first calculate residuals (where Residual = Observed value – predicted value)
# and add a column in the data dataframe and then make a violin plot as we did
# in Exercise 1. But before doing that we'll add a column in the data dataframe corresponding to the predicted
In [10]:
data.head(100)
# for each year in the data frame we'll make a separate list of residual values and add them
# to the residual_collections list of lists. We will then make a violin plot out of them.
residual_collections = []
# now that data_year has rows where year is year, we will add to residuals all residual values.
residuals = data_year['residual'].tolist()
residual_collections.append(residuals)
# now that residual_collections has the data needed to make a violin plot, we will do that
# violin_plot = ax.violinplot(collections)
fig, ax = plt.subplots()
ax.violinplot(residual_collections,years,widths=4,showmeans=True)
ax.set_xlabel("Year")
ax.set_ylabel("Residual")
violins and data spread closer around 0 for the later year, and this appears to be true in the violin plot.
And as I expected, the shape of the residuals' violins are similar to the shape of the violins of the life expectancies
Exercise 4: Make a boxplot (or violin plot) of model residuals vs. continent.
In [12]:
# Again, we will do the same thing as before to get the violin plots, this time replacing the year in what we d
# in exercise 3 with the continent.
In [13]:
continents = data["continent"].unique()
# for each continent in the data frame we'll make a separate list of residual values and add them
# to the residual_continent_collections list of lists. We will then make a violin plot out of them.
residual_continent_collections = []
# now that data_continent has rows where continent is continent, we will add to residuals all residual valu
residuals = data_continent['residual'].tolist()
residual_continent_collections.append(residuals)
fig, ax = plt.subplots()
violin_plot = ax.violinplot(residual_continent_collections)
# ax.violinplot(residual_continent_collections,continents,widths=4,showmeans=True)
ax.set_xticklabels(c_continents)
ax.set_xlabel("Continent")
ax.set_ylabel("Residual")
ax.set_xticklabels(c_continents)
Question
what 9:
would Is there
that a dependence
suggest when between
performing a model residual
regression and
analysiscontinent?
of life If so,
expectancy across time?
There seems to be a dependence between model residual and continent. Oceania has a more consistent set of values (a relatively
small range) for its residuals although the magnitude of them is rather large (the model gives you values 10-20 from the measured
values). Europe is similar in this regard though its range of residuals is a bit larger, and this may be because of Europe and Oceania
having fewer countries and therefore more "consistent" values for its LEs and therefore the residuals. But Asia, Africa, and the
Americas being the largest continents with the most countries, probably have more variance in their LEs and have larger magnitudes
and ranges of residuals as a result.
So this information must be considered for doing regression analysis across time. Asia, Americas, and Africa will have a large effect
on the average LE across continents and may not give the most descriptive picture since the LE may be changing across time
differently for each continent.
In [14]:
# We will iterate through each continent and then do what we did in Exercise 1 (violins)
# and in Exercise 2 (regression) to look at violin plots by continent for LE across time
# for each continent, filter by year and add those LEs to an array and use the array of arrays for the years
LEs = data_continent_year['lifeExp'].tolist()
collections.append(LEs)
fig, ax = plt.subplots()
ax.violinplot(collections,years,widths=4,showmeans=True)
ax.set_xlabel("Year")
ax.set_ylabel("Life Expectancy")
ax.set_title("Violin Plot of Life Expectancy across time for the continent " + continent)
Exercise
year, 5:
grouped As in
by the Moneyball
continent, and project,
add a make a
regressionscatter
line. plot
The of life
result expectancy
here can be vs.
given as either
orregression
a single plot one
with scatter
each plot per
continent'scontinent,
points each
plotted with
in a its own
different regression
color, and line,
one
line per continent's points. The former is probably easier to code up.
file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 13/41
3/18/22, 4:16 PM p3code
In [15]: # for each continent, get the years and the lifeexps and make a linear regression out of it as in Exercise 2.
# And make a plot and add the line to the plot as in Project 2
years = data_continent["year"].tolist()
lifeExps = data_continent["lifeExp"].tolist()
reg = linear_model.LinearRegression()
reg.fit(X, Y)
[intercept] = reg.intercept_.tolist()
[[slope]] = reg.coef_.tolist()
years = np.array(years)
lifeExps = np.array(lifeExps)
plt.figure(figsize=(8,4))
plt.scatter(years, lifeExps)
plt.xlabel("Year")
plt.ylabel("Life Expectancy")
m, b = np.polyfit(years, lifeExps, 1)
plt.show()
In [17]:
mod = smf.ols(formula='lifeExp ~ year * continent', data=data)
print(reg.summary())
==============================================================================
Df Model: 9
==============================================================================================
----------------------------------------------------------------------------------------------
==============================================================================
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.09e+06. This might indicate that there are
Question
different 11:
from Are all
zero? parameters
If not, in
which the
are model
not significantly
significantly (in
differentthe p-value
from zero?sense)
Other
libraries (statsmodels or patsy may help you solve this problem)
No. For a siginifiance level of <0.1, continent[T.Oceania] and year:continent[T.Oceania] have p-values of 0.287 and 0.360 respectively
which are statistically different from zero. The rest of the parameters are statisticially similar to 0.
file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 18/41
Question 12: On average, by how much does life expectancy increase each year
3/18/22, 4:16 PM p3code
Intercept -524.257846
continent[T.Americas] -138.848447
continent[T.Asia] -312.633049
continent[T.Europe] 156.846852
continent[T.Oceania] 182.349883
year 0.289529
year:continent[T.Americas] 0.078122
year:continent[T.Asia] 0.163593
year:continent[T.Europe] -0.067597
year:continent[T.Oceania] -0.079257
dtype: float64
For Africa, Americas, Asia, Europe, and Oceania respectively, there is an increase of 0.290, an increase of 0.078, an increease of
0.164, a decrease of 0.068, and a decrease 0f 0.079 years of life expectancy, on average, per year in this model. Adding these values
and dividing by 5, we get an average of 0.0767 increase per year overall in LE. (Africa is the reference variable here).
Exercise 7: Make a residuals vs. year violin plot for the interaction model.
Comment on how well it matches assumptions of the linear regression model.
In [19]:
# First, use the dataframe column creation to create a new column predLifeExpInt to associate with the model th
# uses the interaction and create an associated residual column by subtracting the predictede from measured val
predLifeExpInt = reg.predict(data)
data["predLifeExpInt"] = predLifeExpInt
data.head(100)
Out[19]: country continent year lifeExp pop gdpPercap predLifeExp residual predLifeExpInt residualInt
0 Afghanistan Asia 1952 28.801 8425333 779.445314 50.512084 -21.711084 47.604037 -18.803037
1 Afghanistan Asia 1957 30.332 9240934 820.853030 52.141603 -21.809603 49.869649 -19.537649
2 Afghanistan Asia 1962 31.997 10267083 853.100710 53.771122 -21.774122 52.135261 -20.138261
3 Afghanistan Asia 1967 34.020 11537966 836.197138 55.400642 -21.380642 54.400873 -20.380873
file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 19/41
3/18/22, 4:16 PM p3code
country continent year lifeExp pop gdpPercap predLifeExp residual predLifeExpInt residualInt
4 Afghanistan Asia 1972 36.088 13079460 739.981106 57.030161 -20.942161 56.666485 -20.578485
... ... ... ... ... ... ... ... ... ... ...
95 Bahrain Asia 2007 75.635 708573 29796.048340 68.436795 7.198205 72.525769 3.109231
96 Bangladesh Asia 1952 37.484 46886859 684.244172 50.512084 -13.028084 47.604037 -10.120037
97 Bangladesh Asia 1957 39.348 51365468 661.637458 52.141603 -12.793603 49.869649 -10.521649
98 Bangladesh Asia 1962 41.216 56839289 686.341554 53.771122 -12.555122 52.135261 -10.919261
99 Bangladesh Asia 1967 43.453 62821884 721.186086 55.400642 -11.947642 54.400873 -10.947873
100 rows × 10 columns
In [20]:
# As in Exercise 3, we'll make a residual vs time for the interaction model
# for each year in the data frame we'll make a separate list of residual values and add them
# to the residual_collections list of lists. We will then make a violin plot out of them.
residual_collections = []
# now that data_year has rows where year is year, we will add to residuals all residual values.
residuals = data_year['residualInt'].tolist()
residual_collections.append(residuals)
# now that residual_collections has the data needed to make a violin plot, we will do that
# violin_plot = ax.violinplot(collections)
fig, ax = plt.subplots()
ax.violinplot(residual_collections,years,widths=4,showmeans=True)
ax.set_xlabel("Year")
Out[20]: Text(0.5, 1.0, 'Violin Plot of Residuals of Interaction Model across time')
It matches the assumptions of the linear regression model in that the data is roughly centered around 0. This makes sense because
we expect the residual to mostly be close to 0 and less frequent residual values much greater/less than 0. The residuals seem to be
about normally distributed for each year. The residuals for each year are independent from the look of the different shapes of the
violins.
Part 2 : Classification
We'll look at the Iris dataset from sci kit learn https://scikit-learn.org/stable/datasets/toy_dataset.html#iris-plants-dataset. From the
website,
Number of Instances: 150 (50 in each of three classes)
Number of Attributes: 4 numeric, predictive attributes and the class
Attribute Information: sepal length in cm , sepal width in cm , petal length in cm , petal width in cm
class: Iris-Setosa , Iris-Versicolour , Iris-Virginica
So the information about the Iris flowers' petals and sepals (length and width of each) are given as attributes with the classifications.
In [21]:
import seaborn as sns
data = load_iris()
# feature_names returns the 10 attributes as a list which we'll use as column names in the dataframe
cols = data.feature_names
# Make a dataframe out of the attributes and we'll then add in the classification as a new column
df["classif"] = y
df
Out[21]: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) classif
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 2
146 6.3 2.5 5.0 1.9 2
147 6.5 3.0 5.2 2.0 2
148 6.2 3.4 5.4 2.3 2
149 5.9 3.0 5.1 1.8 2
150 rows × 5 columns
In [22]:
df_0 = df[df.classif == 0]
df_0.describe().head(3)
Out[22]: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) classif
count 50.00000 50.000000 50.000000 50.000000 50.0
mean 5.00600 3.428000 1.462000 0.246000 0.0
file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 22/41
3/18/22, 4:16 PM p3code
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) classif
std 0.35249 0.379064 0.173664 0.105386 0.0
In [23]:
df_1 = df[df.classif == 1]
df_1.describe().head(3)
Out[23]: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) classif
count 50.000000 50.000000 50.000000 50.000000 50.0
mean 5.936000 2.770000 4.260000 1.326000 1.0
std 0.516171 0.313798 0.469911 0.197753 0.0
In [24]:
df_2 = df[df.classif == 2]
df_2.describe().head(3)
Out[24]: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) classif
count 50.00000 50.000000 50.000000 50.00000 50.0
mean 6.58800 2.974000 5.552000 2.02600 2.0
std 0.63588 0.322497 0.551895 0.27465 0.0
We see that the mean and standard deviation of the latter 2 attributes (petal length and petal width) are quite dissimilar. Most of the
data (falling under 1 or 2 st. deviations from the mean) is not going to be overlapping for the latter 2 attributes from a quick glance at
the mean and stdevs for each. So this looks like a decent fit for Linear Discriminant Analysis (LDA), our first algorithm.
Method 1 : Linear Discriminant Analysis (LDA)
We'll use holdout validation. First, we will split the data by half into training and testing data. We will have to use the df_0, df_1, df_2
dataframe which filter by the classification to randomly split them into halfs with the training data twice as large as the test data and
then join the first half for the training data and the second half for the test data.
In [25]:
# We'll split by first shuffling the rows for each dataframe associated with a class and then splitting by half
# then we'll add to the train and test dfs below (the dfs of which we'll concatenate at the end)
train_dfs = []
# for each of the dfs associated with a class, we'll shuffle the rows and then split in the middle and put the
# first half in the training set and second half in the test set and then concatenate the dfs in the sets at th
for df in dfs:
df_test = df_shuffled.iloc[mid :]
train_dfs.append(df_train)
test_dfs.append(df_test)
train_df = pd.concat(train_dfs)
test_df = pd.concat(test_dfs)
# pd.set_option('display.max_rows', None)
train_df
Out[25]: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) classif
34 4.9 3.1 1.5 0.2 0
36 5.5 3.5 1.3 0.2 0
1 4.9 3.0 1.4 0.2 0
38 4.4 3.0 1.3 0.2 0
8 4.4 2.9 1.4 0.2 0
... ... ... ... ... ...
140 6.7 3.1 5.6 2.4 2
144 6.7 3.3 5.7 2.5 2
126 6.2 2.8 4.8 1.8 2
127 6.1 3.0 4.9 1.8 2
128 6.4 2.8 5.6 2.1 2
99 rows × 5 columns
In [26]:
test_df
Out[26]: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) classif
14 5.8 4.0 1.2 0.2 0
43 5.0 3.5 1.6 0.6 0
7 5.0 3.4 1.5 0.2 0
31 5.4 3.4 1.5 0.4 0
2 4.7 3.2 1.3 0.2 0
5 5.4 3.9 1.7 0.4 0
22 4.6 3.6 1.0 0.2 0
42 4.4 3.2 1.3 0.2 0
37 4.9 3.6 1.4 0.1 0
6 4.6 3.4 1.4 0.3 0
4 5.0 3.6 1.4 0.2 0
30 4.8 3.1 1.6 0.2 0
33 5.5 4.2 1.4 0.2 0
45 4.8 3.0 1.4 0.3 0
11 4.8 3.4 1.6 0.2 0
32 5.2 4.1 1.5 0.1 0
48 5.3 3.7 1.5 0.2 0
64 5.6 2.9 3.6 1.3 1
93 5.0 2.3 3.3 1.0 1
57 4.9 2.4 3.3 1.0 1
81 5.5 2.4 3.7 1.0 1
52 6.9 3.1 4.9 1.5 1
55 5.7 2.8 4.5 1.3 1
72 6.3 2.5 4.9 1.5 1
92 5.8 2.6 4.0 1.2 1
87 6.3 2.3 4.4 1.3 1
file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 25/41
3/18/22, 4:16 PM p3code
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) classif
56 6.3 3.3 4.7 1.6 1
54 6.5 2.8 4.6 1.5 1
80 5.5 2.4 3.8 1.1 1
83 6.0 2.7 5.1 1.6 1
95 5.7 3.0 4.2 1.2 1
61 5.9 3.0 4.2 1.5 1
82 5.8 2.7 3.9 1.2 1
98 5.1 2.5 3.0 1.1 1
114 5.8 2.8 5.1 2.4 2
143 6.8 3.2 5.9 2.3 2
107 7.3 2.9 6.3 1.8 2
131 7.9 3.8 6.4 2.0 2
102 7.1 3.0 5.9 2.1 2
105 7.6 3.0 6.6 2.1 2
122 7.7 2.8 6.7 2.0 2
142 5.8 2.7 5.1 1.9 2
137 6.4 3.1 5.5 1.8 2
106 4.9 2.5 4.5 1.7 2
104 6.5 3.0 5.8 2.2 2
130 7.4 2.8 6.1 1.9 2
133 6.3 2.8 5.1 1.5 2
145 6.7 3.0 5.2 2.3 2
111 6.4 2.7 5.3 1.9 2
132 6.4 2.8 5.6 2.2 2
148 6.2 3.4 5.4 2.3 2
Now, we will do LDA on the test dataframe.
file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 26/41
3/18/22, 4:16 PM p3code
In [27]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# separate the test df into attributes and classification and convert to numpy to use LDA
X = X_df.to_numpy()
y = np.array(train_df["classif"].tolist())
clf_lda = LinearDiscriminantAnalysis()
clf_lda.fit(X, y)
Out[27]: LinearDiscriminantAnalysis()
Now that we have a fit, we'll predict values using our test_df
In [28]:
test_attributes = test_df.drop(['classif'], axis=1).to_numpy().tolist()
correct_vals = test_df["classif"].tolist()
predictions = []
# for each set of attributes, make a prediction and add it to list of predictions which is added to the test df
for attributes in test_attributes :
[prediction] = clf_lda.predict([attributes])
predictions.append(prediction)
test_df["prediction_LDA"] = predictions
With the predictions and the classifications, we can look at the accuracy of the classifier. We'll define accuracy as
correctpredictions
accuracy =
totalpredictions
In [29]:
predictions = test_df["prediction_LDA"].tolist()
classifications = test_df["classif"].tolist()
correctness = []
num_correct = 0
# for each pair of measured class and the predicted class, check if they're equal and make a list of true/false
# based on correctness to be added to the df
for i in range(len(test_df)):
correctness.append(check)
if (check == True):
num_correct += 1
In [30]:
print("Accuracy of LDA is " + str(accuracy))
test_df
Out[30]: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) classif prediction_LDA correctness_LDA
14 5.8 4.0 1.2 0.2 0 0 True
43 5.0 3.5 1.6 0.6 0 0 True
7 5.0 3.4 1.5 0.2 0 0 True
31 5.4 3.4 1.5 0.4 0 0 True
2 4.7 3.2 1.3 0.2 0 0 True
5 5.4 3.9 1.7 0.4 0 0 True
22 4.6 3.6 1.0 0.2 0 0 True
42 4.4 3.2 1.3 0.2 0 0 True
37 4.9 3.6 1.4 0.1 0 0 True
6 4.6 3.4 1.4 0.3 0 0 True
4 5.0 3.6 1.4 0.2 0 0 True
30 4.8 3.1 1.6 0.2 0 0 True
33 5.5 4.2 1.4 0.2 0 0 True
45 4.8 3.0 1.4 0.3 0 0 True
11 4.8 3.4 1.6 0.2 0 0 True
32 5.2 4.1 1.5 0.1 0 0 True
48 5.3 3.7 1.5 0.2 0 0 True
64 5.6 2.9 3.6 1.3 1 1 True
93 5.0 2.3 3.3 1.0 1 1 True
57 4.9 2.4 3.3 1.0 1 1 True
81 5.5 2.4 3.7 1.0 1 1 True
file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 28/41
3/18/22, 4:16 PM p3code
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) classif prediction_LDA correctness_LDA
52 6.9 3.1 4.9 1.5 1 1 True
55 5.7 2.8 4.5 1.3 1 1 True
72 6.3 2.5 4.9 1.5 1 1 True
92 5.8 2.6 4.0 1.2 1 1 True
87 6.3 2.3 4.4 1.3 1 1 True
56 6.3 3.3 4.7 1.6 1 1 True
54 6.5 2.8 4.6 1.5 1 1 True
80 5.5 2.4 3.8 1.1 1 1 True
83 6.0 2.7 5.1 1.6 1 2 False
95 5.7 3.0 4.2 1.2 1 1 True
61 5.9 3.0 4.2 1.5 1 1 True
82 5.8 2.7 3.9 1.2 1 1 True
98 5.1 2.5 3.0 1.1 1 1 True
114 5.8 2.8 5.1 2.4 2 2 True
143 6.8 3.2 5.9 2.3 2 2 True
107 7.3 2.9 6.3 1.8 2 2 True
131 7.9 3.8 6.4 2.0 2 2 True
102 7.1 3.0 5.9 2.1 2 2 True
105 7.6 3.0 6.6 2.1 2 2 True
122 7.7 2.8 6.7 2.0 2 2 True
142 5.8 2.7 5.1 1.9 2 2 True
137 6.4 3.1 5.5 1.8 2 2 True
106 4.9 2.5 4.5 1.7 2 2 True
104 6.5 3.0 5.8 2.2 2 2 True
130 7.4 2.8 6.1 1.9 2 2 True
133 6.3 2.8 5.1 1.5 2 1 False
file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 29/41
3/18/22, 4:16 PM p3code
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) classif prediction_LDA correctness_LDA
145 6.7 3.0 5.2 2.3 2 2 True
111 6.4 2.7 5.3 1.9 2 2 True
132 6.4 2.8 5.6 2.2 2 2 True
148 6.2 3.4 5.4 2.3 2 2 True
The accuracy for the LDA is about 0.96.
Method 2 : Decision Trees
We will use the same data for ideal comparison so we'll just use the 4 attribute columns from the split we made in the LDA section
above. The default hyperparameters will be used.
In [31]:
from sklearn import tree
clf_tree = tree.DecisionTreeClassifier()
clf_tree = clf_tree.fit(X, y)
# similar to the LDA, we'll get the predictions and then add them to the dataframe and check for correctness
predictions_tree = []
[prediction] = clf_tree.predict([attributes])
predictions_tree.append(prediction)
test_df["prediction_tree"] = predictions_tree
In [32]:
correctness_tree = []
num_correct_tree = 0
# like for LDA, we'll compare each correct value in the test set to the predicted value and get the accuracy
for i in range(len(test_df)):
correctness_tree.append(check)
if (check == True):
num_correct_tree += 1
test_df["correctness_tree"] = correctness_tree
accuracy_tree = num_correct_tree / len(predictions_tree)
test_df
splits
Out[33]: array([0.05, 0.06, 0.07, 0.08, 0.09, 0.1 , 0.11, 0.12, 0.13, 0.14, 0.15,
0.16, 0.17, 0.18, 0.19, 0.2 , 0.21, 0.22, 0.23, 0.24, 0.25, 0.26,
0.27, 0.28, 0.29, 0.3 , 0.31, 0.32, 0.33, 0.34, 0.35, 0.36, 0.37,
0.38, 0.39, 0.4 , 0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48,
0.49, 0.5 , 0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59,
0.6 , 0.61, 0.62, 0.63, 0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.7 ,
0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8 , 0.81,
0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9 , 0.91, 0.92,
0.93, 0.94])
We'll make a function that returns the test_df at the end of doing both methods and also the accuracies of each (it includes all the
work done above). The function will take in the dataframe of the loaded data and the split used.
In [34]:
X,y = load_iris(return_X_y = True)
data = load_iris()
# feature_names returns the 10 attributes as a list which we'll use as column names in the dataframe
cols = data.feature_names
# Make a dataframe out of the attributes and we'll then add in the classification as a new column
df["classif"] = y
df_0 = df[df.classif == 0]
df_1 = df[df.classif == 1]
df_2 = df[df.classif == 2]
train_dfs = []
test_dfs = []
# for each of the dfs associated with a class, we'll shuffle the rows and then split in the middle and put
# first half in the training set and second half in the test set and then concatenate the dfs in the sets a
for df in dfs:
df_test = df_shuffled.iloc[mid :]
train_dfs.append(df_train)
test_dfs.append(df_test)
train_df = pd.concat(train_dfs)
test_df = pd.concat(test_dfs)
# Do LDA
X = X_df.to_numpy()
y = np.array(train_df["classif"].tolist())
clf_lda = LinearDiscriminantAnalysis()
clf_lda.fit(X, y)
# Make Predictions
correct_vals = test_df["classif"].tolist()
predictions = []
[prediction] = clf_lda.predict([attributes])
predictions.append(prediction)
test_df["prediction_LDA"] = predictions
classifications = test_df["classif"].tolist()
correctness = []
num_correct = 0
for i in range(len(test_df)):
correctness.append(check)
if (check == True):
num_correct += 1
test_df["correctness_LDA"] = correctness
# DECISION TREES
clf_tree = tree.DecisionTreeClassifier()
clf_tree = clf_tree.fit(X, y)
# similar to the LDA, we'll get the predictions and then add them to the dataframe and check for correctnes
predictions_tree = []
[prediction] = clf_tree.predict([attributes])
predictions_tree.append(prediction)
test_df["prediction_tree"] = predictions_tree
correctness_tree = []
num_correct_tree = 0
for i in range(len(test_df)):
correctness_tree.append(check)
if (check == True):
num_correct_tree += 1
test_df["correctness_tree"] = correctness_tree
return ret
In [36]:
new_dfs = []
accs = []
accs_tree = []
new_dfs.append(new_df)
accs.append(acc)
accs_tree.append(acc_tree)
# make an array of arrays (where it's an array of rows of a df) to be fed into a df
for i in range(len(accs)):
X_comparison.append(triple)
df_comparison = pd.DataFrame(X_comparison)
pd.set_option('display.max_rows', None)
df_comparison
plt.figure(figsize=(8,4))
plt.scatter(splits, accs)
plt.grid()
plt.xlabel("Split")
plt.ylabel("LDA Accuracy")
plt.show()
In [39]:
# Trees
plt.figure(figsize=(8,4))
plt.scatter(splits, accs_tree)
plt.grid()
plt.show()
We see that for LDA, a small amount (0.3 - 0.5) of the 150 observation set being training results in the most accurate predictions. For
Trees, however, apart from the overfit > 0.90 splits, it seems either a smaller (close to 0.2) or a fairly large (close to 0.8) split results
in higher accuracy. An even split seems to cause the least accurate results.