Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

3/18/22, 4:16 PM p3code

Part 1: Regression analysis of Gapminder data


Exercise 1: Make a scatter plot of life expectancy across time.
In [1]:
import sqlite3

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

# pd.set_option('display.max_rows', None)

# use the local .tsv file and use the pd.read to get the dataframe and viewe the first 5 rows

data = pd.read_csv("08_gap-every-five-years.tsv", sep='\t')

data.head()

Out[1]: country continent year lifeExp pop gdpPercap


0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106

In [2]:
# %matplotlib inline

plt.plot(data["year"], data["lifeExp"], ".")

plt.xlabel("Year")

plt.ylabel("Life Expectancy")

plt.title("Life Expectancy across Time")

plt.grid()

plt.show()

file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 1/41


3/18/22, 4:16 PM p3code

Question
expectancy 1: Is there
across a general
time? Is trend
this (e.g.,
trend increasing
linear? or
(answeringdecreasing)
this for life
qualitatively from
the plot, you will do a statistical analysis of this question shortly)
We see that there is a general trend upwards in life expectancy(LE) as time goes on. In each decade, the lowest life expectancy (other
than the outliers in the late 1970s and in the 1990s) and the highest life expectancy goes up. So the range (top LE - lowest LE)
remains roughly the same, but in an absolute sense the life expectancy is going up.
In this scatter plot below (with all the data points), this is more clear since the white space under the points increases, indicating that
the y-values (the life expectancies) of the datapoints are going up as the years go by.
In [3]:
plt.plot(data["year"], data["lifeExp"])

plt.xlabel("Year")

plt.ylabel("Life Expectancy")

plt.title("Life Expectancy across Time")

plt.grid()

plt.show()

file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 2/41


3/18/22, 4:16 PM p3code

In [4]:
years = data["year"].unique()

# for each year in the data frame we'll make a separate list of Life Expectancy values and add them

# to the collections list of lists. We will then make a violin plot out of them.

collections = []

# data[data.year == 2007]

for year in years:

data_year = data[data.year == year]

# now that data_year has rows where year is year, we will add to LEs all LE values.

# this list will hold all the life expectancy(LE) values in the df where year is year

LEs = data_year['lifeExp'].tolist()

# now that LEs has all the LE values for when year is year, we will add this list to collections

collections.append(LEs)

# now that collections has the data needed to make a violin plot, we will do that

# violin_plot = ax.violinplot(collections)

fig, ax = plt.subplots()

ax.violinplot(collections,years,widths=4,showmeans=True)

ax.set_xlabel("Year")

ax.set_ylabel("Life Expectancy")

ax.set_title("Violin Plot of Life Expectancy across time")

Out[4]: Text(0.5, 1.0, 'Violin Plot of Life Expectancy across time')

file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 3/41


3/18/22, 4:16 PM p3code

As we can see from the violin plot above as well, the life expectancy is going up as the years go by. The violins for each successive
year are more clustered at the top indicating that a higher amount of people in each year will reach the high ends of the life
expectancy range for that year. The top and bottom of the violins also go up over time (so the window of highest LEs and lowest LEs
is going up over time too).
Question
countries 2: How
for would
individual you describe
years? Is it the distribution
skewed, or not? of life
Unimodal expectancy
or not? across
Symmetric
around it’s center?
The distribution of LE across years for each country changes from being skewed "down" to being symmetrical to being skewed "up."
To be clear, In 1952 and 1957, most countries had LEs in the lower end. Then, in 1962,1967, and 1972, the distribution appears more
symmetric in that there appear to be two modes and most countries have either the low or high ends of LEs (either the 60-70 or 30-
50 range).
Around 1977 we see that the data is beginning to be skewed "up," in that there are more countries that have LEs in the upper end of
the LEs for the year than countries that have the lower LEs for that year. As the years go by, this appears more pronounced. In 2007,
for example, it's clear that the majority of countries have LEs in the 60-80 range. Also after 1977, the distribution appears fairly
unimodal with most countries having LEs in the 60-80 range with the mode being around 70 (from a quick glance at the violin plot).
Question 3: Suppose I fit a linear regression model of life expectancy vs. year
file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 4/41
(treating it as a continuous variable), and test for a relationship between year and
3/18/22, 4:16 PM p3code

life expectancy, will you reject the null hypothesis of


without fitting the model yet. I am testing your intuition.)no relationship? (do this
Looking at the violin plot and considering that the violins' top and bottom are going up over time and also that the violins are wider at
the top as the years go by, there seems to be a positive relation, so a linear regression model will probably reject the null hypothesis
of no relationship.
Question
Question 4:
3 What
vs. would
year look a violin
like? plot
(Again, of residuals
don’t do thefrom the
analysislinear
yet, model
answer in
this
intuitively)
The residuals are the differences between the measured and predicted values. A violin plot of residuals from the linear model in Q3
plotted against time would look similar to the violin plot of the life expectancies across the years.
For the later years, since the distribution is more unimodel and skewed "up" as described above, the violin plots of the residuals
would look similar to the violins of the violin plot above. This is because the expected value from the model would be closer to the
mode of each violin. Since the violins for the latter years are more unimodal, the violins of the residuals for that year would look
similar. So the violins for the residuals for the latter years would be smaller and be more unimodal
Since the violins of the earlier years are less unimodal, the violins of the residuals for those years would be less unimodal and have
differences of larger magnitudes (i.e. no clear bulge associated with a mode) since the differences between the linear model and
actual values will vary a lot. So the violins for the residuals for early years would be larger and less unimodal.
Question 5: According to the assumptions of the linear regression model,
should that violin plot look like? That is, consider the assumptions the linear what
regression model you used assumes
etc); do you think everything is okay? (e.g., about noise, about input distributions,
If the linear regression model is created correctly, the distribution should be symmetric about 0 for each violin (since you expect most
of the measured values to have a difference of 0 from your model for a given year). As described in the answer to question 4 above,
you would also expect there to be more standard deviation around that 0 for the earlier years' violins since the LEs for those years are
not as clustered as they are for the latter years.

file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 5/41


Exercise 2: Fit a linear regression model using, e.g., the LinearRegression
3/18/22, 4:16 PM p3code

function from Scikit-Learn or the closed-form solution, for life expectancy


year (as a continuous variable). There is no need to plot anything here, but vs.
please print the fitted model out in a readable format.
In [5]:
from sklearn import linear_model

In [6]:
# set x and y values to simply be the list of all year values (as they occur) and the lifeExp values

# (as they occur in data)

X = [[xval] for xval in data["year"].tolist()]

Y = [[yval] for yval in data["lifeExp"].tolist()]

reg = linear_model.LinearRegression()

reg.fit(X, Y)

reg.coef_

Out[6]: array([[0.32590383]])

In [7]:
# extract the intercept and slope of the regression.

[intercept] = reg.intercept_.tolist()

[[slope]] = reg.coef_.tolist()

In [8]:
print("The equation is y = " + str(slope) + "* x + (" + str(intercept) + ")")

The equation is y = 0.3259038276371518* x + (-585.6521874415448)

The equation is y = 0.3259038276371518* x - 585.6521874415448 for the linear regression.


Question 6: On average,
around the world? by how much does life expectancy increase every year
Based on the regression computed, LE increases on average by 0.33 every year around the world.
Question 7: Do you reject
life expectancy? Why? the null hypothesis of no relationship between year and
file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 6/41
3/18/22, 4:16 PM p3code

I would reject the null hypothesis because a slope of approximately 0.33 is substantial and indicates that there is likely a positive
correlation betweeen year and life expectancy (LE).
Exercise 3:
Exercise 2. Make a violin plot of residuals vs. year for the linear model from
In [9]:
# We will first calculate residuals (where Residual = Observed value – predicted value)

# and add a column in the data dataframe and then make a violin plot as we did

# in Exercise 1. But before doing that we'll add a column in the data dataframe corresponding to the predicted

# LE based on year from our regression

data["predLifeExp"] = data["year"] * slope + intercept

data["residual"] = data["lifeExp"] - data["predLifeExp"]

In [10]:
data.head(100)

Out[10]: country continent year lifeExp pop gdpPercap predLifeExp residual


0 Afghanistan Asia 1952 28.801 8425333 779.445314 50.512084 -21.711084
1 Afghanistan Asia 1957 30.332 9240934 820.853030 52.141603 -21.809603
2 Afghanistan Asia 1962 31.997 10267083 853.100710 53.771122 -21.774122
3 Afghanistan Asia 1967 34.020 11537966 836.197138 55.400642 -21.380642
4 Afghanistan Asia 1972 36.088 13079460 739.981106 57.030161 -20.942161
... ... ... ... ... ... ... ... ...
95 Bahrain Asia 2007 75.635 708573 29796.048340 68.436795 7.198205
96 Bangladesh Asia 1952 37.484 46886859 684.244172 50.512084 -13.028084
97 Bangladesh Asia 1957 39.348 51365468 661.637458 52.141603 -12.793603
98 Bangladesh Asia 1962 41.216 56839289 686.341554 53.771122 -12.555122
99 Bangladesh Asia 1967 43.453 62821884 721.186086 55.400642 -11.947642
100 rows × 8 columns
In [11]:
file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 7/41
3/18/22, 4:16 PM p3code
# Now we will make a violin plot of the residuals across each year

# for each year in the data frame we'll make a separate list of residual values and add them

# to the residual_collections list of lists. We will then make a violin plot out of them.

residual_collections = []

for year in years:

data_year = data[data.year == year]

# now that data_year has rows where year is year, we will add to residuals all residual values.

residuals = data_year['residual'].tolist()

residual_collections.append(residuals)

# now that residual_collections has the data needed to make a violin plot, we will do that

# violin_plot = ax.violinplot(collections)

fig, ax = plt.subplots()

ax.violinplot(residual_collections,years,widths=4,showmeans=True)

ax.set_xlabel("Year")

ax.set_ylabel("Residual")

ax.set_title("Violin Plot of Residuals across time")

Out[11]: Text(0.5, 1.0, 'Violin Plot of Residuals across time')

Question 8: Does the


answered Question 4)?plot of Exercise 3 match your expectations (as you
It matches my expectations as described above in Question 4. I expected there to be more data further from 0 for the earlier years'
file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 8/41
3/18/22, 4:16 PM p3code

violins and data spread closer around 0 for the later year, and this appears to be true in the violin plot.
And as I expected, the shape of the residuals' violins are similar to the shape of the violins of the life expectancies
Exercise 4: Make a boxplot (or violin plot) of model residuals vs. continent.
In [12]:
# Again, we will do the same thing as before to get the violin plots, this time replacing the year in what we d
# in exercise 3 with the continent.

In [13]:
continents = data["continent"].unique()

# for each continent in the data frame we'll make a separate list of residual values and add them

# to the residual_continent_collections list of lists. We will then make a violin plot out of them.

residual_continent_collections = []

for continent in continents:

data_continent = data[data.continent == continent]

# now that data_continent has rows where continent is continent, we will add to residuals all residual valu
residuals = data_continent['residual'].tolist()

residual_continent_collections.append(residuals)

fig, ax = plt.subplots()

violin_plot = ax.violinplot(residual_continent_collections)

# ax.violinplot(residual_continent_collections,continents,widths=4,showmeans=True)

# use this list for the violin plot x axis labels

c_continents = ["n"] + continents.tolist()

ax.set_xticklabels(c_continents)

ax.set_xlabel("Continent")

ax.set_ylabel("Residual")

ax.set_title("Violin Plot of Residuals across continent")

<ipython-input-13-05684d1270d1>:18: UserWarning: FixedFormatter should only be used together with FixedLocator

ax.set_xticklabels(c_continents)

Out[13]: Text(0.5, 1.0, 'Violin Plot of Residuals across continent')

file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 9/41


3/18/22, 4:16 PM p3code

Question
what 9:
would Is there
that a dependence
suggest when between
performing a model residual
regression and
analysiscontinent?
of life If so,
expectancy across time?
There seems to be a dependence between model residual and continent. Oceania has a more consistent set of values (a relatively
small range) for its residuals although the magnitude of them is rather large (the model gives you values 10-20 from the measured
values). Europe is similar in this regard though its range of residuals is a bit larger, and this may be because of Europe and Oceania
having fewer countries and therefore more "consistent" values for its LEs and therefore the residuals. But Asia, Africa, and the
Americas being the largest continents with the most countries, probably have more variance in their LEs and have larger magnitudes
and ranges of residuals as a result.
So this information must be considered for doing regression analysis across time. Asia, Americas, and Africa will have a large effect
on the average LE across continents and may not give the most descriptive picture since the LE may be changing across time
differently for each continent.
In [14]:
# We will iterate through each continent and then do what we did in Exercise 1 (violins)

# and in Exercise 2 (regression) to look at violin plots by continent for LE across time

# for each continent, filter by year and add those LEs to an array and use the array of arrays for the years

# to plot a violin plot for each continent.

for continent in continents:

data_continent = data[data.continent == continent]

file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 10/41


3/18/22, 4:16 PM p3code
collections = []

for year in years:

data_continent_year = data_continent[data_continent.year == year]

LEs = data_continent_year['lifeExp'].tolist()

collections.append(LEs)

fig, ax = plt.subplots()

ax.violinplot(collections,years,widths=4,showmeans=True)

ax.set_xlabel("Year")

ax.set_ylabel("Life Expectancy")

ax.set_title("Violin Plot of Life Expectancy across time for the continent " + continent)

file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 11/41


3/18/22, 4:16 PM p3code

file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 12/41


3/18/22, 4:16 PM p3code

Exercise
year, 5:
grouped As in
by the Moneyball
continent, and project,
add a make a
regressionscatter
line. plot
The of life
result expectancy
here can be vs.
given as either
orregression
a single plot one
with scatter
each plot per
continent'scontinent,
points each
plotted with
in a its own
different regression
color, and line,
one
line per continent's points. The former is probably easier to code up.
file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 13/41
3/18/22, 4:16 PM p3code
In [15]: # for each continent, get the years and the lifeexps and make a linear regression out of it as in Exercise 2.

# And make a plot and add the line to the plot as in Project 2

for continent in continents:

data_continent = data[data.continent == continent]

years = data_continent["year"].tolist()

lifeExps = data_continent["lifeExp"].tolist()

X = [[xval] for xval in years]

Y = [[yval] for yval in lifeExps]

reg = linear_model.LinearRegression()

reg.fit(X, Y)

[intercept] = reg.intercept_.tolist()

[[slope]] = reg.coef_.tolist()

years = np.array(years)

lifeExps = np.array(lifeExps)

plt.figure(figsize=(8,4))

plt.scatter(years, lifeExps)

plt.title("Life Expectancy Across Time For " + continent)


plt.grid()

plt.xlabel("Year")

plt.ylabel("Life Expectancy")

m, b = np.polyfit(years, lifeExps, 1)

plt.plot(years, float(m) * years + float(b))

plt.show()

file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 14/41


3/18/22, 4:16 PM p3code

file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 15/41


3/18/22, 4:16 PM p3code

file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 16/41


3/18/22, 4:16 PM p3code

Question 10: Based on this plot, should your


interaction term for continent and year? Why? regression model include an
Yes. Clearly, for each continent, the data and the distribution of the data is different. For Oceania, for example, the data is almost
linear and there is such low difference between its only two countries used here (NZ and Australia) for their LEs for any given year.
Contrast this to, say, Asia, where there is a massive amount of variation in the data. Even though the trend lines for all continents have
a positive slope, these differences are very substantial and warrant an interaction term for continent. Looking at all the continents in
aggregate hides the specific information about LE trends for each continent.
Exercise
an 6: Fit
interaction a linear
between regression
continent model
and for
year. life
Printexpectancy
out the including
model in a a term
readable for
format, e.g., print the coefficients of the model (no need to plot). Hint: adding
interaction terms is a form of feature engineering, like we discussed
(think about, e.g., using (a subset of) polynomial features here). in class
In [16]:
import statsmodels.api as sm

import statsmodels.formula.api as smf

In [17]:
mod = smf.ols(formula='lifeExp ~ year * continent', data=data)

file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 17/41


3/18/22, 4:16 PM p3code
reg = mod.fit()

print(reg.summary())

OLS Regression Results

==============================================================================

Dep. Variable: lifeExp R-squared: 0.693

Model: OLS Adj. R-squared: 0.691

Method: Least Squares F-statistic: 424.3

Date: Mon, 22 Nov 2021 Prob (F-statistic): 0.00

Time: 04:49:48 Log-Likelihood: -5771.9

No. Observations: 1704 AIC: 1.156e+04

Df Residuals: 1694 BIC: 1.162e+04

Df Model: 9

Covariance Type: nonrobust

==============================================================================================

coef std err t P>|t| [0.025 0.975]

----------------------------------------------------------------------------------------------

Intercept -524.2578 32.963 -15.904 0.000 -588.911 -459.605

continent[T.Americas] -138.8484 57.851 -2.400 0.016 -252.315 -25.382

continent[T.Asia] -312.6330 52.904 -5.909 0.000 -416.396 -208.870

continent[T.Europe] 156.8469 54.498 2.878 0.004 49.957 263.737

continent[T.Oceania] 182.3499 171.283 1.065 0.287 -153.599 518.298

year 0.2895 0.017 17.387 0.000 0.257 0.322

year:continent[T.Americas] 0.0781 0.029 2.673 0.008 0.021 0.135

year:continent[T.Asia] 0.1636 0.027 6.121 0.000 0.111 0.216

year:continent[T.Europe] -0.0676 0.028 -2.455 0.014 -0.122 -0.014

year:continent[T.Oceania] -0.0793 0.087 -0.916 0.360 -0.249 0.090

==============================================================================

Omnibus: 27.121 Durbin-Watson: 0.242

Prob(Omnibus): 0.000 Jarque-Bera (JB): 44.106

Skew: -0.121 Prob(JB): 2.65e-10

Kurtosis: 3.750 Cond. No. 2.09e+06

==============================================================================

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

[2] The condition number is large, 2.09e+06. This might indicate that there are

strong multicollinearity or other numerical problems.

Question
different 11:
from Are all
zero? parameters
If not, in
which the
are model
not significantly
significantly (in
differentthe p-value
from zero?sense)
Other
libraries (statsmodels or patsy may help you solve this problem)
No. For a siginifiance level of <0.1, continent[T.Oceania] and year:continent[T.Oceania] have p-values of 0.287 and 0.360 respectively
which are statistically different from zero. The rest of the parameters are statisticially similar to 0.
file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 18/41
Question 12: On average, by how much does life expectancy increase each year
3/18/22, 4:16 PM p3code

for each continent? (Provide


estimates from model fit) code to answer this question by extracting relevant
In [18]:
print(reg.params)

Intercept -524.257846

continent[T.Americas] -138.848447

continent[T.Asia] -312.633049

continent[T.Europe] 156.846852

continent[T.Oceania] 182.349883

year 0.289529

year:continent[T.Americas] 0.078122

year:continent[T.Asia] 0.163593

year:continent[T.Europe] -0.067597

year:continent[T.Oceania] -0.079257

dtype: float64

For Africa, Americas, Asia, Europe, and Oceania respectively, there is an increase of 0.290, an increase of 0.078, an increease of
0.164, a decrease of 0.068, and a decrease 0f 0.079 years of life expectancy, on average, per year in this model. Adding these values
and dividing by 5, we get an average of 0.0767 increase per year overall in LE. (Africa is the reference variable here).
Exercise 7: Make a residuals vs. year violin plot for the interaction model.
Comment on how well it matches assumptions of the linear regression model.
In [19]:
# First, use the dataframe column creation to create a new column predLifeExpInt to associate with the model th
# uses the interaction and create an associated residual column by subtracting the predictede from measured val
predLifeExpInt = reg.predict(data)

data["predLifeExpInt"] = predLifeExpInt

data["residualInt"] = data["lifeExp"] - data["predLifeExpInt"]

data.head(100)

Out[19]: country continent year lifeExp pop gdpPercap predLifeExp residual predLifeExpInt residualInt
0 Afghanistan Asia 1952 28.801 8425333 779.445314 50.512084 -21.711084 47.604037 -18.803037
1 Afghanistan Asia 1957 30.332 9240934 820.853030 52.141603 -21.809603 49.869649 -19.537649
2 Afghanistan Asia 1962 31.997 10267083 853.100710 53.771122 -21.774122 52.135261 -20.138261
3 Afghanistan Asia 1967 34.020 11537966 836.197138 55.400642 -21.380642 54.400873 -20.380873
file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 19/41
3/18/22, 4:16 PM p3code

country continent year lifeExp pop gdpPercap predLifeExp residual predLifeExpInt residualInt
4 Afghanistan Asia 1972 36.088 13079460 739.981106 57.030161 -20.942161 56.666485 -20.578485
... ... ... ... ... ... ... ... ... ... ...
95 Bahrain Asia 2007 75.635 708573 29796.048340 68.436795 7.198205 72.525769 3.109231
96 Bangladesh Asia 1952 37.484 46886859 684.244172 50.512084 -13.028084 47.604037 -10.120037
97 Bangladesh Asia 1957 39.348 51365468 661.637458 52.141603 -12.793603 49.869649 -10.521649
98 Bangladesh Asia 1962 41.216 56839289 686.341554 53.771122 -12.555122 52.135261 -10.919261
99 Bangladesh Asia 1967 43.453 62821884 721.186086 55.400642 -11.947642 54.400873 -10.947873
100 rows × 10 columns
In [20]:
# As in Exercise 3, we'll make a residual vs time for the interaction model

# for each year in the data frame we'll make a separate list of residual values and add them

# to the residual_collections list of lists. We will then make a violin plot out of them.

residual_collections = []

for year in years:

data_year = data[data.year == year]

# now that data_year has rows where year is year, we will add to residuals all residual values.

residuals = data_year['residualInt'].tolist()

residual_collections.append(residuals)

# now that residual_collections has the data needed to make a violin plot, we will do that

# violin_plot = ax.violinplot(collections)

fig, ax = plt.subplots()

ax.violinplot(residual_collections,years,widths=4,showmeans=True)

ax.set_xlabel("Year")

ax.set_ylabel("Residual for Interaction Model")

ax.set_title("Violin Plot of Residuals of Interaction Model across time")

Out[20]: Text(0.5, 1.0, 'Violin Plot of Residuals of Interaction Model across time')

file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 20/41


3/18/22, 4:16 PM p3code

It matches the assumptions of the linear regression model in that the data is roughly centered around 0. This makes sense because
we expect the residual to mostly be close to 0 and less frequent residual values much greater/less than 0. The residuals seem to be
about normally distributed for each year. The residuals for each year are independent from the look of the different shapes of the
violins.
Part 2 : Classification
We'll look at the Iris dataset from sci kit learn https://scikit-learn.org/stable/datasets/toy_dataset.html#iris-plants-dataset. From the
website,
Number of Instances: 150 (50 in each of three classes)
Number of Attributes: 4 numeric, predictive attributes and the class
Attribute Information: sepal length in cm , sepal width in cm , petal length in cm , petal width in cm
class: Iris-Setosa , Iris-Versicolour , Iris-Virginica
So the information about the Iris flowers' petals and sepals (length and width of each) are given as attributes with the classifications.
In [21]:
import seaborn as sns

from sklearn import tree

file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 21/41


3/18/22, 4:16 PM p3code
from sklearn.datasets import load_iris

X,y = load_iris(return_X_y = True)

data = load_iris()

# feature_names returns the 10 attributes as a list which we'll use as column names in the dataframe

cols = data.feature_names

# Make a dataframe out of the attributes and we'll then add in the classification as a new column

df = pd.DataFrame(X, columns = cols)

df["classif"] = y

df

Out[21]: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) classif
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 2
146 6.3 2.5 5.0 1.9 2
147 6.5 3.0 5.2 2.0 2
148 6.2 3.4 5.4 2.3 2
149 5.9 3.0 5.1 1.8 2
150 rows × 5 columns
In [22]:
df_0 = df[df.classif == 0]

df_0.describe().head(3)

Out[22]: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) classif
count 50.00000 50.000000 50.000000 50.000000 50.0
mean 5.00600 3.428000 1.462000 0.246000 0.0
file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 22/41
3/18/22, 4:16 PM p3code

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) classif
std 0.35249 0.379064 0.173664 0.105386 0.0

In [23]:
df_1 = df[df.classif == 1]

df_1.describe().head(3)

Out[23]: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) classif
count 50.000000 50.000000 50.000000 50.000000 50.0
mean 5.936000 2.770000 4.260000 1.326000 1.0
std 0.516171 0.313798 0.469911 0.197753 0.0

In [24]:
df_2 = df[df.classif == 2]

df_2.describe().head(3)

Out[24]: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) classif
count 50.00000 50.000000 50.000000 50.00000 50.0
mean 6.58800 2.974000 5.552000 2.02600 2.0
std 0.63588 0.322497 0.551895 0.27465 0.0
We see that the mean and standard deviation of the latter 2 attributes (petal length and petal width) are quite dissimilar. Most of the
data (falling under 1 or 2 st. deviations from the mean) is not going to be overlapping for the latter 2 attributes from a quick glance at
the mean and stdevs for each. So this looks like a decent fit for Linear Discriminant Analysis (LDA), our first algorithm.
Method 1 : Linear Discriminant Analysis (LDA)
We'll use holdout validation. First, we will split the data by half into training and testing data. We will have to use the df_0, df_1, df_2
dataframe which filter by the classification to randomly split them into halfs with the training data twice as large as the test data and
then join the first half for the training data and the second half for the test data.
In [25]:
# We'll split by first shuffling the rows for each dataframe associated with a class and then splitting by half
# then we'll add to the train and test dfs below (the dfs of which we'll concatenate at the end)

train_dfs = []

file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 23/41


3/18/22, 4:16 PM p3code
test_dfs = []

dfs = [df_0, df_1, df_2]

# for each of the dfs associated with a class, we'll shuffle the rows and then split in the middle and put the

# first half in the training set and second half in the test set and then concatenate the dfs in the sets at th
for df in dfs:

df_shuffled = df.sample(frac = 1, random_state = 50)

mid = int(len(df) * (2/3))

df_train = df_shuffled.iloc[: mid]

df_test = df_shuffled.iloc[mid :]

train_dfs.append(df_train)

test_dfs.append(df_test)

train_df = pd.concat(train_dfs)

test_df = pd.concat(test_dfs)

# pd.set_option('display.max_rows', None)

train_df

Out[25]: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) classif
34 4.9 3.1 1.5 0.2 0
36 5.5 3.5 1.3 0.2 0
1 4.9 3.0 1.4 0.2 0
38 4.4 3.0 1.3 0.2 0
8 4.4 2.9 1.4 0.2 0
... ... ... ... ... ...
140 6.7 3.1 5.6 2.4 2
144 6.7 3.3 5.7 2.5 2
126 6.2 2.8 4.8 1.8 2
127 6.1 3.0 4.9 1.8 2
128 6.4 2.8 5.6 2.1 2
99 rows × 5 columns
In [26]:
test_df

file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 24/41


3/18/22, 4:16 PM p3code

Out[26]: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) classif
14 5.8 4.0 1.2 0.2 0
43 5.0 3.5 1.6 0.6 0
7 5.0 3.4 1.5 0.2 0
31 5.4 3.4 1.5 0.4 0
2 4.7 3.2 1.3 0.2 0
5 5.4 3.9 1.7 0.4 0
22 4.6 3.6 1.0 0.2 0
42 4.4 3.2 1.3 0.2 0
37 4.9 3.6 1.4 0.1 0
6 4.6 3.4 1.4 0.3 0
4 5.0 3.6 1.4 0.2 0
30 4.8 3.1 1.6 0.2 0
33 5.5 4.2 1.4 0.2 0
45 4.8 3.0 1.4 0.3 0
11 4.8 3.4 1.6 0.2 0
32 5.2 4.1 1.5 0.1 0
48 5.3 3.7 1.5 0.2 0
64 5.6 2.9 3.6 1.3 1
93 5.0 2.3 3.3 1.0 1
57 4.9 2.4 3.3 1.0 1
81 5.5 2.4 3.7 1.0 1
52 6.9 3.1 4.9 1.5 1
55 5.7 2.8 4.5 1.3 1
72 6.3 2.5 4.9 1.5 1
92 5.8 2.6 4.0 1.2 1
87 6.3 2.3 4.4 1.3 1
file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 25/41
3/18/22, 4:16 PM p3code

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) classif
56 6.3 3.3 4.7 1.6 1
54 6.5 2.8 4.6 1.5 1
80 5.5 2.4 3.8 1.1 1
83 6.0 2.7 5.1 1.6 1
95 5.7 3.0 4.2 1.2 1
61 5.9 3.0 4.2 1.5 1
82 5.8 2.7 3.9 1.2 1
98 5.1 2.5 3.0 1.1 1
114 5.8 2.8 5.1 2.4 2
143 6.8 3.2 5.9 2.3 2
107 7.3 2.9 6.3 1.8 2
131 7.9 3.8 6.4 2.0 2
102 7.1 3.0 5.9 2.1 2
105 7.6 3.0 6.6 2.1 2
122 7.7 2.8 6.7 2.0 2
142 5.8 2.7 5.1 1.9 2
137 6.4 3.1 5.5 1.8 2
106 4.9 2.5 4.5 1.7 2
104 6.5 3.0 5.8 2.2 2
130 7.4 2.8 6.1 1.9 2
133 6.3 2.8 5.1 1.5 2
145 6.7 3.0 5.2 2.3 2
111 6.4 2.7 5.3 1.9 2
132 6.4 2.8 5.6 2.2 2
148 6.2 3.4 5.4 2.3 2
Now, we will do LDA on the test dataframe.
file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 26/41
3/18/22, 4:16 PM p3code

In [27]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# separate the test df into attributes and classification and convert to numpy to use LDA

X_df = train_df.drop(['classif'], axis=1)

X = X_df.to_numpy()

y = np.array(train_df["classif"].tolist())

clf_lda = LinearDiscriminantAnalysis()

clf_lda.fit(X, y)

Out[27]: LinearDiscriminantAnalysis()

Now that we have a fit, we'll predict values using our test_df
In [28]:
test_attributes = test_df.drop(['classif'], axis=1).to_numpy().tolist()

correct_vals = test_df["classif"].tolist()

predictions = []

# for each set of attributes, make a prediction and add it to list of predictions which is added to the test df
for attributes in test_attributes :

[prediction] = clf_lda.predict([attributes])

predictions.append(prediction)

test_df["prediction_LDA"] = predictions

With the predictions and the classifications, we can look at the accuracy of the classifier. We'll define accuracy as
correctpredictions
accuracy =
totalpredictions

In [29]:
predictions = test_df["prediction_LDA"].tolist()

classifications = test_df["classif"].tolist()

correctness = []

num_correct = 0

# for each pair of measured class and the predicted class, check if they're equal and make a list of true/false
# based on correctness to be added to the df

for i in range(len(test_df)):

check = (classifications[i] == predictions[i])

correctness.append(check)

if (check == True):

num_correct += 1

file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 27/41


3/18/22, 4:16 PM p3code
test_df["correctness_LDA"] = correctness

accuracy = num_correct / len(predictions)

In [30]:
print("Accuracy of LDA is " + str(accuracy))

test_df

Accuracy of LDA is 0.9607843137254902

Out[30]: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) classif prediction_LDA correctness_LDA
14 5.8 4.0 1.2 0.2 0 0 True
43 5.0 3.5 1.6 0.6 0 0 True
7 5.0 3.4 1.5 0.2 0 0 True
31 5.4 3.4 1.5 0.4 0 0 True
2 4.7 3.2 1.3 0.2 0 0 True
5 5.4 3.9 1.7 0.4 0 0 True
22 4.6 3.6 1.0 0.2 0 0 True
42 4.4 3.2 1.3 0.2 0 0 True
37 4.9 3.6 1.4 0.1 0 0 True
6 4.6 3.4 1.4 0.3 0 0 True
4 5.0 3.6 1.4 0.2 0 0 True
30 4.8 3.1 1.6 0.2 0 0 True
33 5.5 4.2 1.4 0.2 0 0 True
45 4.8 3.0 1.4 0.3 0 0 True
11 4.8 3.4 1.6 0.2 0 0 True
32 5.2 4.1 1.5 0.1 0 0 True
48 5.3 3.7 1.5 0.2 0 0 True
64 5.6 2.9 3.6 1.3 1 1 True
93 5.0 2.3 3.3 1.0 1 1 True
57 4.9 2.4 3.3 1.0 1 1 True
81 5.5 2.4 3.7 1.0 1 1 True
file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 28/41
3/18/22, 4:16 PM p3code

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) classif prediction_LDA correctness_LDA
52 6.9 3.1 4.9 1.5 1 1 True
55 5.7 2.8 4.5 1.3 1 1 True
72 6.3 2.5 4.9 1.5 1 1 True
92 5.8 2.6 4.0 1.2 1 1 True
87 6.3 2.3 4.4 1.3 1 1 True
56 6.3 3.3 4.7 1.6 1 1 True
54 6.5 2.8 4.6 1.5 1 1 True
80 5.5 2.4 3.8 1.1 1 1 True
83 6.0 2.7 5.1 1.6 1 2 False
95 5.7 3.0 4.2 1.2 1 1 True
61 5.9 3.0 4.2 1.5 1 1 True
82 5.8 2.7 3.9 1.2 1 1 True
98 5.1 2.5 3.0 1.1 1 1 True
114 5.8 2.8 5.1 2.4 2 2 True
143 6.8 3.2 5.9 2.3 2 2 True
107 7.3 2.9 6.3 1.8 2 2 True
131 7.9 3.8 6.4 2.0 2 2 True
102 7.1 3.0 5.9 2.1 2 2 True
105 7.6 3.0 6.6 2.1 2 2 True
122 7.7 2.8 6.7 2.0 2 2 True
142 5.8 2.7 5.1 1.9 2 2 True
137 6.4 3.1 5.5 1.8 2 2 True
106 4.9 2.5 4.5 1.7 2 2 True
104 6.5 3.0 5.8 2.2 2 2 True
130 7.4 2.8 6.1 1.9 2 2 True
133 6.3 2.8 5.1 1.5 2 1 False
file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 29/41
3/18/22, 4:16 PM p3code

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) classif prediction_LDA correctness_LDA
145 6.7 3.0 5.2 2.3 2 2 True
111 6.4 2.7 5.3 1.9 2 2 True
132 6.4 2.8 5.6 2.2 2 2 True
148 6.2 3.4 5.4 2.3 2 2 True
The accuracy for the LDA is about 0.96.
Method 2 : Decision Trees
We will use the same data for ideal comparison so we'll just use the 4 attribute columns from the split we made in the LDA section
above. The default hyperparameters will be used.
In [31]:
from sklearn import tree

clf_tree = tree.DecisionTreeClassifier()

clf_tree = clf_tree.fit(X, y)

# similar to the LDA, we'll get the predictions and then add them to the dataframe and check for correctness

predictions_tree = []

for attributes in test_attributes :

[prediction] = clf_tree.predict([attributes])

predictions_tree.append(prediction)

test_df["prediction_tree"] = predictions_tree

In [32]:
correctness_tree = []

num_correct_tree = 0

# like for LDA, we'll compare each correct value in the test set to the predicted value and get the accuracy

for i in range(len(test_df)):

check = (classifications[i] == predictions_tree[i])

correctness_tree.append(check)

if (check == True):

num_correct_tree += 1

test_df["correctness_tree"] = correctness_tree
accuracy_tree = num_correct_tree / len(predictions_tree)

print("Accuracy of LDA is " + str(accuracy))

file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 30/41


3/18/22, 4:16 PM p3code
print("Accuracy of decision trees is " + str(accuracy_tree))

test_df

Accuracy of LDA is 0.9607843137254902

Accuracy of decision trees is 0.9215686274509803

sepal sepal petal


length width petal
Out[32]:
length (cm) width (cm) (cm) (cm) classif prediction_LDA correctness_LDA prediction_tree correctness_tree
14 5.8 4.0 1.2 0.2 0 0 True 0 True
43 5.0 3.5 1.6 0.6 0 0 True 0 True
7 5.0 3.4 1.5 0.2 0 0 True 0 True
31 5.4 3.4 1.5 0.4 0 0 True 0 True
2 4.7 3.2 1.3 0.2 0 0 True 0 True
5 5.4 3.9 1.7 0.4 0 0 True 0 True
22 4.6 3.6 1.0 0.2 0 0 True 0 True
42 4.4 3.2 1.3 0.2 0 0 True 0 True
37 4.9 3.6 1.4 0.1 0 0 True 0 True
6 4.6 3.4 1.4 0.3 0 0 True 0 True
4 5.0 3.6 1.4 0.2 0 0 True 0 True
30 4.8 3.1 1.6 0.2 0 0 True 0 True
33 5.5 4.2 1.4 0.2 0 0 True 0 True
45 4.8 3.0 1.4 0.3 0 0 True 0 True
11 4.8 3.4 1.6 0.2 0 0 True 0 True
32 5.2 4.1 1.5 0.1 0 0 True 0 True
48 5.3 3.7 1.5 0.2 0 0 True 0 True
64 5.6 2.9 3.6 1.3 1 1 True 1 True
93 5.0 2.3 3.3 1.0 1 1 True 1 True
57 4.9 2.4 3.3 1.0 1 1 True 1 True
81 5.5 2.4 3.7 1.0 1 1 True 1 True
52 6.9 3.1 4.9 1.5 1 1 True 2 False
file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 31/41
3/18/22, 4:16 PM p3code

sepal sepal petal


length (cm) width (cm) length width petal
(cm) classif prediction_LDA correctness_LDA prediction_tree correctness_tree
(cm)
55 5.7 2.8 4.5 1.3 1 1 True 1 True
72 6.3 2.5 4.9 1.5 1 1 True 2 False
92 5.8 2.6 4.0 1.2 1 1 True 1 True
87 6.3 2.3 4.4 1.3 1 1 True 1 True
56 6.3 3.3 4.7 1.6 1 1 True 1 True
54 6.5 2.8 4.6 1.5 1 1 True 1 True
80 5.5 2.4 3.8 1.1 1 1 True 1 True
83 6.0 2.7 5.1 1.6 1 2 False 2 False
95 5.7 3.0 4.2 1.2 1 1 True 1 True
61 5.9 3.0 4.2 1.5 1 1 True 1 True
82 5.8 2.7 3.9 1.2 1 1 True 1 True
98 5.1 2.5 3.0 1.1 1 1 True 1 True
114 5.8 2.8 5.1 2.4 2 2 True 2 True
143 6.8 3.2 5.9 2.3 2 2 True 2 True
107 7.3 2.9 6.3 1.8 2 2 True 2 True
131 7.9 3.8 6.4 2.0 2 2 True 2 True
102 7.1 3.0 5.9 2.1 2 2 True 2 True
105 7.6 3.0 6.6 2.1 2 2 True 2 True
122 7.7 2.8 6.7 2.0 2 2 True 2 True
142 5.8 2.7 5.1 1.9 2 2 True 2 True
137 6.4 3.1 5.5 1.8 2 2 True 2 True
106 4.9 2.5 4.5 1.7 2 2 True 1 False
104 6.5 3.0 5.8 2.2 2 2 True 2 True
130 7.4 2.8 6.1 1.9 2 2 True 2 True
133 6.3 2.8 5.1 1.5 2 1 False 2 True
file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 32/41
3/18/22, 4:16 PM p3code

sepal sepal petal


length (cm) width (cm) length width petal
(cm) classif prediction_LDA correctness_LDA prediction_tree correctness_tree
(cm)
145 6.7 3.0 5.2 2.3 2 2 True 2 True
111 6.4 2.7 5.3 1.9 2 2 True 2 True
132 6.4 2.8 5.6 2.2 2 2 True 2 True
148 6.2 3.4 5.4 2.3 2 2 True 2 True
The decision trees method also has a 0.92 accuracy as opposed to the LDA's 0.96 accuracy.
We can look at some different splits of the data (for the training and testing) instead of a 2/3 train and 1/3 test.
We'll look at whether
LDA and decision trees will have different results for the following splits for the proportion of training to total data.
Overfitting/underfitting may affect the results differently.
In [33]:
splits = (np.arange(5, 95, 1))/100

splits

Out[33]: array([0.05, 0.06, 0.07, 0.08, 0.09, 0.1 , 0.11, 0.12, 0.13, 0.14, 0.15,

0.16, 0.17, 0.18, 0.19, 0.2 , 0.21, 0.22, 0.23, 0.24, 0.25, 0.26,

0.27, 0.28, 0.29, 0.3 , 0.31, 0.32, 0.33, 0.34, 0.35, 0.36, 0.37,

0.38, 0.39, 0.4 , 0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48,

0.49, 0.5 , 0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59,

0.6 , 0.61, 0.62, 0.63, 0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.7 ,

0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8 , 0.81,

0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9 , 0.91, 0.92,

0.93, 0.94])
We'll make a function that returns the test_df at the end of doing both methods and also the accuracies of each (it includes all the
work done above). The function will take in the dataframe of the loaded data and the split used.
In [34]:
X,y = load_iris(return_X_y = True)

data = load_iris()

# feature_names returns the 10 attributes as a list which we'll use as column names in the dataframe

cols = data.feature_names

# Make a dataframe out of the attributes and we'll then add in the classification as a new column

df = pd.DataFrame(X, columns = cols)

df["classif"] = y

file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 33/41


3/18/22, 4:16 PM p3code
In [35]: def compare (split, df):

# split dfs by class

df_0 = df[df.classif == 0]

df_1 = df[df.classif == 1]

df_2 = df[df.classif == 2]

train_dfs = []

test_dfs = []

dfs = [df_0, df_1, df_2]

# for each of the dfs associated with a class, we'll shuffle the rows and then split in the middle and put
# first half in the training set and second half in the test set and then concatenate the dfs in the sets a
for df in dfs:

df_shuffled = df.sample(frac = 1, random_state = 50)

mid = int(len(df) * split)

df_train = df_shuffled.iloc[: mid]

df_test = df_shuffled.iloc[mid :]

train_dfs.append(df_train)

test_dfs.append(df_test)

train_df = pd.concat(train_dfs)

test_df = pd.concat(test_dfs)

# Do LDA

X_df = train_df.drop(['classif'], axis=1)

X = X_df.to_numpy()

y = np.array(train_df["classif"].tolist())

clf_lda = LinearDiscriminantAnalysis()

clf_lda.fit(X, y)

# Make Predictions

test_attributes = test_df.drop(['classif'], axis=1).to_numpy().tolist()

correct_vals = test_df["classif"].tolist()

predictions = []

for attributes in test_attributes :

[prediction] = clf_lda.predict([attributes])

predictions.append(prediction)

test_df["prediction_LDA"] = predictions

# Check for correctness

classifications = test_df["classif"].tolist()

correctness = []

num_correct = 0

for i in range(len(test_df)):

file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 34/41


3/18/22, 4:16 PM p3code
check = (classifications[i] == predictions[i])

correctness.append(check)

if (check == True):

num_correct += 1

test_df["correctness_LDA"] = correctness

accuracy = num_correct / len(predictions)

# DECISION TREES

clf_tree = tree.DecisionTreeClassifier()

clf_tree = clf_tree.fit(X, y)

# similar to the LDA, we'll get the predictions and then add them to the dataframe and check for correctnes
predictions_tree = []

for attributes in test_attributes :

[prediction] = clf_tree.predict([attributes])

predictions_tree.append(prediction)

test_df["prediction_tree"] = predictions_tree

correctness_tree = []

num_correct_tree = 0

for i in range(len(test_df)):

check = (classifications[i] == predictions_tree[i])

correctness_tree.append(check)

if (check == True):

num_correct_tree += 1

test_df["correctness_tree"] = correctness_tree

accuracy_tree = num_correct_tree / len(predictions_tree)

ret = [test_df , accuracy, accuracy_tree]

return ret

In [36]:
new_dfs = []

accs = []

accs_tree = []

for split in splits:

[new_df, acc,acc_tree] = compare(split,df)

new_dfs.append(new_df)

accs.append(acc)

accs_tree.append(acc_tree)

Visualizing this neatly in a dataframe:


file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 35/41
3/18/22, 4:16 PM p3code
In [37]: X_comparison = []

# make an array of arrays (where it's an array of rows of a df) to be fed into a df

for i in range(len(accs)):

triple = [splits[i], accs[i], accs_tree[i]]

X_comparison.append(triple)

df_comparison = pd.DataFrame(X_comparison)

df_comparison.columns = ['split', 'accuracy_LDA', 'accuracy_tree']

pd.set_option('display.max_rows', None)

df_comparison

Out[37]: split accuracy_LDA accuracy_tree


0 0.05 0.923611 0.881944
1 0.06 0.950355 0.943262
2 0.07 0.950355 0.943262
3 0.08 0.963768 0.949275
4 0.09 0.963768 0.949275
5 0.10 0.985185 0.948148
6 0.11 0.985185 0.948148
7 0.12 0.977273 0.946970
8 0.13 0.977273 0.946970
9 0.14 0.976744 0.945736
10 0.15 0.976744 0.945736
11 0.16 0.984127 0.944444
12 0.17 0.984127 0.944444
13 0.18 0.967480 0.943089
14 0.19 0.967480 0.943089
15 0.20 0.966667 0.941667
16 0.21 0.966667 0.941667
17 0.22 0.982906 0.940171
18 0.23 0.982906 0.940171
file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 36/41
3/18/22, 4:16 PM p3code

split accuracy_LDA accuracy_tree


19 0.24 0.973684 0.938596
20 0.25 0.973684 0.938596
21 0.26 0.972973 0.936937
22 0.27 0.972973 0.936937
23 0.28 0.972222 0.935185
24 0.29 0.972222 0.935185
25 0.30 0.971429 0.933333
26 0.31 0.971429 0.933333
27 0.32 0.970588 0.931373
28 0.33 0.970588 0.931373
29 0.34 0.969697 0.929293
30 0.35 0.969697 0.929293
31 0.36 0.968750 0.927083
32 0.37 0.968750 0.927083
33 0.38 0.967742 0.924731
34 0.39 0.967742 0.924731
35 0.40 0.955556 0.922222
36 0.41 0.955556 0.922222
37 0.42 0.954023 0.919540
38 0.43 0.954023 0.919540
39 0.44 0.964286 0.916667
40 0.45 0.964286 0.916667
41 0.46 0.962963 0.913580
42 0.47 0.962963 0.913580
43 0.48 0.961538 0.910256
44 0.49 0.961538 0.910256
file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 37/41
3/18/22, 4:16 PM p3code

split accuracy_LDA accuracy_tree


45 0.50 0.960000 0.906667
46 0.51 0.960000 0.906667
47 0.52 0.958333 0.902778
48 0.53 0.958333 0.902778
49 0.54 0.956522 0.898551
50 0.55 0.956522 0.898551
51 0.56 0.969697 0.909091
52 0.57 0.969697 0.909091
53 0.58 0.969697 0.909091
54 0.59 0.968254 0.904762
55 0.60 0.966667 0.900000
56 0.61 0.966667 0.900000
57 0.62 0.964912 0.912281
58 0.63 0.964912 0.912281
59 0.64 0.962963 0.925926
60 0.65 0.962963 0.925926
61 0.66 0.960784 0.921569
62 0.67 0.960784 0.921569
63 0.68 0.958333 0.916667
64 0.69 0.958333 0.916667
65 0.70 0.955556 0.911111
66 0.71 0.955556 0.911111
67 0.72 0.952381 0.904762
68 0.73 0.952381 0.904762
69 0.74 0.948718 0.897436
70 0.75 0.948718 0.897436
file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 38/41
3/18/22, 4:16 PM p3code

split accuracy_LDA accuracy_tree


71 0.76 0.944444 0.944444
72 0.77 0.944444 0.944444
73 0.78 0.939394 0.909091
74 0.79 0.939394 0.909091
75 0.80 0.933333 0.933333
76 0.81 0.933333 0.933333
77 0.82 0.925926 0.925926
78 0.83 0.925926 0.925926
79 0.84 0.916667 0.916667
80 0.85 0.916667 0.916667
81 0.86 0.904762 0.952381
82 0.87 0.904762 0.952381
83 0.88 0.888889 0.944444
84 0.89 0.888889 0.944444
85 0.90 0.866667 0.933333
86 0.91 0.866667 0.933333
87 0.92 1.000000 1.000000
88 0.93 1.000000 1.000000
89 0.94 1.000000 1.000000
Let us look at a scatter plot of how the split affects both methods.
In [38]:
# LDA

plt.figure(figsize=(8,4))

plt.scatter(splits, accs)

plt.title("Split effect on accuracy of LDA")

plt.grid()

plt.xlabel("Split")

plt.ylabel("LDA Accuracy")

plt.show()

file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 39/41


3/18/22, 4:16 PM p3code

In [39]:
# Trees

plt.figure(figsize=(8,4))

plt.scatter(splits, accs_tree)

plt.title("Split effect on accuracy of Decision Trees")

plt.grid()

plt.xlabel("Split (fraction of data that is training)")

plt.ylabel("Decision Tree Accuracy")

plt.show()

file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 40/41


3/18/22, 4:16 PM p3code

We see that for LDA, a small amount (0.3 - 0.5) of the 150 observation set being training results in the most accurate predictions. For
Trees, however, apart from the overfit > 0.90 splits, it seems either a smaller (close to 0.2) or a fairly large (close to 0.8) split results
in higher accuracy. An even split seems to cause the least accurate results.

file:///Users/anudeepmetuku/Downloads/CMSC320 Projects/CMSC320_Project_3.html 41/41

You might also like