Professional Documents
Culture Documents
Submission Group 12 Exercise 5
Submission Group 12 Exercise 5
1 Data Science I
1.1 Exercise 5: Data Visualization & Machine Learning Intro
<div style="position: absolute; top: -90px; right: 10px; padding: 5px; background-color: #ddd;
<span style="font-weight: bold;">Overall Points: / 100</span
Submission Deadline: June 12 2023, 07:00 UTC
University of Oldenburg Summer 2023 Instructors: Maria Fernanda “MaFe” Davila Re-
strepo, Wolfram “Wolle” Wingerath
Submitted by: < Uzun,Burak - Yalcin,Mehmet - Akalin, Alp >
Action 8492622725
Comedy 6252564583
Drama 4998404600
1
Thriller/Suspense 2148224436
Horror 1359616191
Adventure films frequently have a broad appeal and can draw in a variety of viewers. Action,
adventure, and exploration are frequently present in these movies, which can captivate audiences
and create a rich cinematic experience.
Adventure movies offer a sense of escapism and give spectators the chance to participate virtu-
ally in exhilarating and thrilling adventures. Audiences can become fascinated and experience an
adrenaline rush because to the fascinating plotlines, exotic environments, and high-stakes action
scenes.
Difference between the movie has the maximum gross and the minimum gross: $531122065.
# Pandas Options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
2
"""
# Range of the US Gross
3
# The R-squared value is 0.337, which indicates that approximately 33.7% of the␣
↪variation in the 'SALE PRICE' can be explained by the 'GROSS SQUARE FEET'␣
↪variable.
# Coefficient for 'GROSS SQUARE FEET' is 296.1355, which indicates that, on␣
↪average, for each additional square foot, the predicted 'SALE PRICE'␣
↪increases by $296.1355.
# Both the coefficient for 'GROSS SQUARE FEET' and the intercept have␣
↪relatively small standard errors, and their t-values are significantly␣
↪different from zero (p < 0.05), indicating that they are statistically␣
↪significant.
4
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 4.71e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
c) Analyze the 2012 Olympic data set. What can you say about the relationship
between a country’s population and the number of medals it wins? What can you say
about the relationship between the ratio of female and male counts and the GDP of
that country?
Solution:
The R-squared value is 0.138, which indicates that approximately 13.8% of the varia-
tion in the ‘Number of Medals’ is explained by the ‘2010 population’. This means that
the model does not capture a significant amount of the variation in the number of
medals.Our p-value is 0.107 which is higher than the conventional threshold of 0.05,
suggesting that the relationship between the population and the number of medals
may not be statistically significant. The coefficient for the 2010 population is 3.175e-
08. This indicates that, on average, a one-unit increase in the population is associated
with an increase of approximately 0.00000003175 in the number of medals.
5
The R-squared value is 0.094, which indicates that approximately 9.4% of the variation
in the dependent variable 2011 GDP is explained by the independent variable Male-
Female Ratio. The p-value is 0.190, suggesting that the relationship between the
male-female athlete ratio and GDP may not be statistically significant. Maybe there
is a relationship between the GDP and Male-Female Ratio, because we can see that
developed countries with high GDP has a ratio like 1 or closer to 1. But according to
our results there is no significant relationship with country’s GDP and Male/Female
athletes ratio in 2012 London Olympics. I think GDP Per Cap might be a better
variable to observe this relationship.
olympics_df.dtypes
'''
ISO country code object
Country name object
2011 GDP float64
2010 population int64
Female count int64
Male count int64
Gold medals int64
Silver medals int64
Bronze medals int64
dtype: object'''
olympics_df['2011 GDP'] = olympics_df['2011 GDP'].astype('int')
olympics_df.isna().sum() # 0 NA values
# Creating a new column to aggregate all the medals they won. I am doing this␣
↪because I am going to regress it with population.
olympics_df.head()
# Linear Regression
X = olympics_df['2010 population']
y = olympics_df['Number of Medals']
X = sm.add_constant(X)
# Fitting the Model
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
# Visualizing the Relationship
plt.scatter(olympics_df['2010 population'], olympics_df['Number of Medals'],␣
↪color='blue')
6
plt.xlabel('2010 Population')
plt.ylabel('Total Medals')
plt.title('Relationship between Population and Total Medals')
plt.show()
# The R-squared value is 0.138, which indicates that approximately 13.8% of the␣
↪variation in the 'Number of Medals' is explained by the '2010 population'.␣
↪This means that the model does not capture a significant amount of the␣
# Our p-value is 0.107 which is higher than the conventional threshold of 0.05,␣
↪suggesting that the relationship between the population and the number of␣
# The coefficient for the 2010 population is 3.175e-08. This indicates that, on␣
↪average, a one-unit increase in the population is associated with an␣
y = olympics_df['2011 GDP']
X = olympics_df['Male-Female Ratio']
X = sm.add_constant(X)
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
# The R-squared value is 0.094, which indicates that approximately 9.4% of the␣
↪variation in the dependent variable 2011 GDP is explained by the independent␣
# Maybe there is a relationship between the GDP and Male-Female Ratio, because␣
↪we can see that developed countries with high GDP has a ratio like 1 or␣
↪London Olympics.
# I think GDP Per Cap might be a better variable to observe this relationship.
plt.scatter(olympics_df['Male-Female Ratio'], olympics_df['2011 GDP'])
plt.xlabel('Male-Female Ratio')
plt.ylabel('2011 GDP')
plt.title('Relationship between Male-Female Ratio and 2011 GDP')
plt.show()
7
Dep. Variable: Number of Medals R-squared: 0.138
Model: OLS Adj. R-squared: 0.090
Method: Least Squares F-statistic: 2.875
Date: Sun, 11 Jun 2023 Prob (F-statistic): 0.107
Time: 22:52:07 Log-Likelihood: -96.257
No. Observations: 20 AIC: 196.5
Df Residuals: 18 BIC: 198.5
Df Model: 1
Covariance Type: nonrobust
================================================================================
===
coef std err t P>|t| [0.025
0.975]
--------------------------------------------------------------------------------
---
const 21.4487 7.809 2.747 0.013 5.042
37.855
2010 population 3.175e-08 1.87e-08 1.696 0.107 -7.59e-09
7.11e-08
==============================================================================
Omnibus: 3.842 Durbin-Watson: 1.357
Prob(Omnibus): 0.146 Jarque-Bera (JB): 2.316
Skew: 0.824 Prob(JB): 0.314
Kurtosis: 3.258 Cond. No. 4.64e+08
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 4.64e+08. This might indicate that there are
strong multicollinearity or other numerical problems.
8
OLS Regression Results
==============================================================================
Dep. Variable: 2011 GDP R-squared: 0.094
Model: OLS Adj. R-squared: 0.043
Method: Least Squares F-statistic: 1.857
Date: Sun, 11 Jun 2023 Prob (F-statistic): 0.190
Time: 22:52:07 Log-Likelihood: -605.52
No. Observations: 20 AIC: 1215.
Df Residuals: 18 BIC: 1217.
Df Model: 1
Covariance Type: nonrobust
================================================================================
=====
coef std err t P>|t| [0.025
0.975]
--------------------------------------------------------------------------------
-----
const 3.647e+12 1.29e+12 2.831 0.011 9.41e+11
6.35e+12
Male-Female Ratio -8.002e+11 5.87e+11 -1.363 0.190 -2.03e+12
9
4.34e+11
==============================================================================
Omnibus: 26.230 Durbin-Watson: 0.613
Prob(Omnibus): 0.000 Jarque-Bera (JB): 42.571
Skew: 2.244 Prob(JB): 5.70e-10
Kurtosis: 8.563 Cond. No. 4.00
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
d) Analyze the GDP per capita data set. How do countries from Europe, Asia, and
Africa compare in the rates of growth in GDP? When have countries faced substantial
changes in GDP, and what historical events were likely most responsible for it?
Solution:
10
2.1 Substantial Changes and Causes
When a country experience substantial growth or decrease in its GDP Per Cap, there might be
several reasons. First and most likely one is ‘Technological Changes’, in the field of economics, we
define technology as ‘Changes or innovations in the production methods’. That is why Industrial
Revolution is still considered as the most important development in the human history and civi-
lization. We can see the significant GDP Per Cap growth at the beginning of the 19th century.
This is also the time when we escaped the Malthusian trap.
Second biggest reason is wars, The world economies were significantly impacted by both the wars.
Due to the devastation, loss of infrastructure, and resource allocation for the war effort, the GDP
of many nations fell during the war periods. However, when nations rebuilt and industrialized, the
post-war era frequently experienced considerable economic recoveries and growth.
Other
rea-
sons
that
have
nega-
tive
im-
pacts
on
GDP
Growth
are
global
fi-
nan-
cial
cri-
sis,
colo-
niza-
tion,
pan-
demics.
It can be said that after the industrial revolution, GDP per Capita has increased considerably
in Europe, Africa and Asia. But the difference between Europe and Africa in particular can be
explained by the early industrializing European countries dominating global trade with new means
of production and colonial activities. Apart from this, if we consider that GDP Per Capita is
population indexed, we can explain the welfare of European countries with low-population.
11
[5]: gpercap = pd.read_excel('/content/gapdata.xlsx', sheet_name=1, index_col = 0)
gpercap.head()
gpercap.reset_index(inplace = True)
#Europe
# I am creating a new datasets by choosing the country by continents to make␣
↪observations.
# Visualizing the GDP per capita values for each European country
plt.figure(figsize=(10, 6))
for country, i in grouped_europe_df:
plt.plot(i['Year'], i['GDP per capita - with interpolations'],␣
↪label=country)
plt.xlabel('Year')
plt.ylabel('GDP per Capita')
plt.title('GDP per Capita for European Countries')
plt.legend()
plt.show()
# Asian Countries
asia_df = gpercap[gpercap['Area'].isin(asian_countries)]
grouped_asia_df = asia_df.groupby('Area')
plt.figure(figsize=(10, 6))
for country, i in grouped_asia_df:
plt.plot(i['Year'], i['GDP per capita - with interpolations'],␣
↪label=country)
plt.xlabel('Year')
plt.ylabel('GDP per Capita')
plt.title('GDP per Capita for Asian Countries')
plt.legend()
plt.show()
# African Countries
africa_df = gpercap[gpercap['Area'].isin(african_countries)]
grouped_africa_df = africa_df.groupby('Area')
plt.figure(figsize=(10, 6))
12
for country, i in grouped_africa_df:
plt.plot(i['Year'], i['GDP per capita - with interpolations'],␣
↪label=country)
plt.xlabel('Year')
plt.ylabel('GDP per Capita')
plt.title('GDP per Capita for African Countries')
plt.legend()
plt.show()
13
2.) For one data set of your own choosing, answer the following basic questions:
a) Who constructed it, when, and why?
14
Solution:
It is created by Mattias Lindgren at 21/04/2014(Version 14) to gather all the GDP Per Cap
historical data. Data is good for research purposes because it even includes early 16th century for
almost every country.
b) How big is it?
Solution:
It has 58112 rows and 16 columns which are: ‘Area’, ‘Year’, ‘GDP per capita - with interpolations’,
‘Type of data’, ‘Data period’, ‘Longer description’, ‘Growth linked to which year’, ‘If use region or
neighbor - what is used’, ‘Source’, ‘Source notes’, ‘Other notes’, ‘Adjustments of data’, ‘Regional
average multiplied with spread-ot factor’, ‘Other footnotes’, ‘Other footnotes.1’, ‘Other footnotes.2’]
(58112, 16) Index(['Area', 'Year', 'GDP per capita - with interpolations', 'Type
of data', 'Data period', 'Longer description', 'Growth linked to which year',
'If use region or neighbor - what is used', 'Source ', 'Source notes', 'Other
notes', 'Adjustments of data', 'Regional average multiplied with spread-ot
factor', 'Other footnotes', 'Other footnotes.1', 'Other footnotes.2'],
dtype='object')
c) Identify a few familiar or interpretable records.
Solution:
Substantial growth in United States GDP Per Cap during the World Wars.
d) Find out and describe what Tukey’s five number summary is and then provide one
for at least 3 different columns.
Solution:
The five-point summary, often known as Tukey’s summary statistics, is a technique for summarizing
a dataset’s distribution. It gives a brief description of the data’s central tendency, distribution, and
form.
print(summarystats)
15
75% 2538.269358 1957.0
max 119849.293354 2018.0
e) State at least one interesting or noteworthy thing that you Learned from your data
set.
Solution:
Most interesting thing is how these countries were very similar in terms of wealth before the 18th
century.
16
are used to distinguish between the different cities where protests have occurred, and this makes it
easy to see which cities have had the most protests.
f-1) How can the graphic be improved? (“https://www.nbcnews.com/data-graphics/tyre-nichols-
protests-erupted-united-states-rcna67987”) The graphic could be improved by including a legend
that explains the different colors that are used. This would make it even easier to understand the
data that is being presented.
Here are some other suggestions for how the graphic could be improved:
The graphic could be made interactive, so that users could click on a city to see more information
about the protests that have occurred there. The graphic could be updated regularly to reflect the
latest data on the number of protests that have occurred. The graphic could be made into a poster
or flyer that could be distributed to raise awareness of the issue of police brutality.
a-2) The graphic does a good job of presenting the data. It is clear and easy to understand, and
it provides a good overview of the amount of smoke from Canadian wildfires that has covered
the United States in the past month. (https://www.nbcnews.com/data-graphics/canada-wildfire-
smoke-covered-us-month-rcna87998)
b-2) The presentation does not appear to be biased, either deliberately or accidentally. The
graphic simply shows the amount of smoke that has covered the United States, and it does
not make any claims about the cause of the wildfires or the impact of the smoke on air quality.
(https://www.nbcnews.com/data-graphics/canada-wildfire-smoke-covered-us-month-rcna87998)
c-2) There is no chartjunk in the figure. The graphic is simple and uncluttered, and it only includes
information that is relevant to the data that is being presented. (https://www.nbcnews.com/data-
graphics/canada-wildfire-smoke-covered-us-month-rcna87998)
d-2) The axes are labeled in a clear and informative way. The x-axis shows the date, and the
y-axis shows the amount of smoke in parts per million (ppm). (https://www.nbcnews.com/data-
graphics/canada-wildfire-smoke-covered-us-month-rcna87998)
e-2) The color is used effectively in the graphic. The different colors are used to distinguish between
the different states that have been affected by smoke from Canadian wildfires, and this makes it easy
to see which states have been most affected. (https://www.nbcnews.com/data-graphics/canada-
wildfire-smoke-covered-us-month-rcna87998)
f-2) The graphic could be improved by including a legend that explains the different colors
that are used. This would make it even easier to understand the data that is being presented.
(https://www.nbcnews.com/data-graphics/canada-wildfire-smoke-covered-us-month-rcna87998)
Here are some other suggestions for how the graphic could be improved:
The graphic could be made interactive, so that users could click on a state to see more information
about the amount of smoke that has covered that state. The graphic could be updated regularly
to reflect the latest data on the amount of smoke from Canadian wildfires that is covering the
United States. The graphic could be made into a poster or flyer that could be distributed to raise
awareness of the issue of air quality and wildfires.
a-3) Does it do a good job or a bad job of presenting the data? Why?
(https://www.nbcnews.com/data-graphics/absentee-homeowners-crowding-housing-market-
data-rcna69828) I think the graphic does a good job of presenting the data. It is clear and easy to
understand, and it provides a good overview of the share of homes sold to absentee owners in nine
17
major metropolitan areas in the United States. The graphic shows that the share of homes sold to
absentee owners has increased since 2020 in all nine areas.
b-3) Does the presentation appear to be biased, either deliberately or accidentally?
(https://www.nbcnews.com/data-graphics/absentee-homeowners-crowding-housing-market-data-
rcna69828)
I do not think the presentation appears to be biased, either deliberately or accidentally. The
graphic simply shows the data, and it does not make any claims about why the share of homes sold
to absentee owners has increased.
c-3) Is there chartjunk in the figure? Where? (https://www.nbcnews.com/data-graphics/absentee-
homeowners-crowding-housing-market-data-rcna69828)
I do not see any chartjunk in the figure. The graphic is simple and uncluttered, and it only includes
information that is relevant to the data that is being presented.
d-3) Are the axes labeled in a clear and informative way? (https://www.nbcnews.com/data-
graphics/absentee-homeowners-crowding-housing-market-data-rcna69828)
Yes, the axes are labeled in a clear and informative way. The x-axis shows the year, and the y-axis
shows the percentage of homes sold to absentee owners.
e-3) Is the color used effectively? (https://www.nbcnews.com/data-graphics/absentee-homeowners-
crowding-housing-market-data-rcna69828)
Yes, the color is used effectively in the graphic. The different colors are used to distinguish between
the different metropolitan areas, and this makes it easy to see which areas have had the highest
and lowest shares of homes sold to absentee owners.
f-3) How can the graphic be improved? (https://www.nbcnews.com/data-graphics/absentee-
homeowners-crowding-housing-market-data-rcna69828)
The graphic could be improved by including a legend that explains the different colors that are
used. This would make it even easier to understand the data that is being presented.
Here are some other suggestions for how the graphic could be improved:
The graphic could be made interactive, so that users could click on a metropolitan area to see more
information about the share of homes sold to absentee owners in that area. The graphic could be
updated regularly to reflect the latest data on the share of homes sold to absentee owners in the
nine metropolitan areas. The graphic could be made into a poster or flyer that could be distributed
to raise awareness of the issue of absentee ownership in the housing market.
a-4) Does it do a good job or a bad job of presenting the data? Why?
(https://www.nbcnews.com/data-graphics/turkey-earthquake-map-see-aftershocks-rcna69713)
I think the graphic does a good job of presenting the data. It is clear and easy to understand,
and it provides a good overview of the location and magnitude of the earthquakes and aftershocks
that have occurred in Turkey and Syria since February 3, 2023. The graphic shows that the
earthquakes have been concentrated in a region along the border between Turkey and Syria, and
that the aftershocks have been gradually decreasing in magnitude.
b-4) Does the presentation appear to be biased, either deliberately or accidentally?
(https://www.nbcnews.com/data-graphics/turkey-earthquake-map-see-aftershocks-rcna69713)
18
I do not think the presentation appears to be biased, either deliberately or accidentally. The graphic
simply shows the data, and it does not make any claims about the cause of the earthquakes or the
impact of the earthquakes and aftershocks on the people of Turkey and Syria.
c-4) Is there chartjunk in the figure? Where? (https://www.nbcnews.com/data-graphics/turkey-
earthquake-map-see-aftershocks-rcna69713)
I do not see any chartjunk in the figure. The graphic is simple and uncluttered, and it only includes
information that is relevant to the data that is being presented.
d-4) Are the axes labeled in a clear and informative way? (https://www.nbcnews.com/data-
graphics/turkey-earthquake-map-see-aftershocks-rcna69713)
Yes, the axes are labeled in a clear and informative way. The x-axis shows the date of the earthquake
or aftershock, and the y-axis shows the magnitude of the earthquake or aftershock.
e-4) Is the color used effectively? (https://www.nbcnews.com/data-graphics/turkey-earthquake-
map-see-aftershocks-rcna69713)
Yes, the color is used effectively in the graphic. The different colors are used to distinguish between
the different earthquakes and aftershocks, and this makes it easy to see which earthquakes and
aftershocks have occurred.
f-4) How can the graphic be improved? (https://www.nbcnews.com/data-graphics/turkey-
earthquake-map-see-aftershocks-rcna69713)
The graphic could be improved by including a legend that explains the different colors that are
used. This would make it even easier to understand the data that is being presented.
Here are some other suggestions for how the graphic could be improved:
The graphic could be made interactive, so that users could click on an earthquake or aftershock to
see more information about it. The graphic could be updated regularly to reflect the latest data
on the earthquakes and aftershocks that have occurred in Turkey and Syria. The graphic could be
made into a poster or flyer that could be distributed to raise awareness of the risk of earthquakes
and aftershocks in Turkey and Syria.
4.) Visit https://viz.wtf and find five laughably bad visualizations. Explain why they
are both bad and amusing.
Solution:
1-)The “Bar Chart Race” That Doesn’t Race (https://bigthink.com/strange-maps/bar-chart-
races/)
This visualization is supposed to be a bar chart race, which is a type of animation that shows how
data changes over time. However, this visualization doesn’t actually race. The bars just sit there,
and the data changes in a very slow and unexciting way.
2-)The Pie Chart That Doesn’t Add Up (https://www.addtwodigital.com/add-two-
blog/2021/2/14/rule-4-the-values-in-your-pie-chart-should-add-up-to-100)
This pie chart is supposed to show the distribution of a population by age group. However, the
percentages don’t add up to 100%. This is a basic mistake that any data visualization should avoid.
3-)The Line Chart That Goes Off the Rails (https://www.fusioncharts.com/line-charts)
19
This line chart is supposed to show the stock market over time. However, the line goes off the rails
at one point. This is a sign that the data is unreliable or that the visualization is not properly
designed.
4-)The Scatterplot That’s Not Scattered (https://chartio.com/learn/charts/what-is-a-scatter-
plot/)
This scatterplot is supposed to show the relationship between two variables. However, the points
are not scattered. This is a sign that the data is not normally distributed or that the visualization
is not properly designed.
5-)The Map That’s Not to Scale (https://www.independent.co.uk/news/science/world-
map-mercator-peters-gall-projection-boston-globe-us-schools-european-colonial-distortion-bias-
a7639101.html)
This map is supposed to show the size of different countries. However, the countries are not to
scale. This is a sign that the map is not properly designed or that the data is not reliable.
These are just a few examples of laughably bad visualizations. There are many more out there,
and they can be quite amusing. However, they are also a reminder that data visualization is a skill
that takes time and practice to master.
20
# Display the styled table
styled_df
c) A scatter plot.
21
Solution:
# Scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(countries, medals, s=50, c='blue', alpha=0.7)
plt.xticks(rotation=90)
plt.xlabel('Country')
plt.ylabel('Number of Medals')
plt.title('2012 London Olympics - Number of Medals by Country')
plt.grid(True)
plt.tight_layout()
plt.show()
d) A heatmap.
Solution:
22
columns = ['2011 GDP', '2010 population', 'Number of Medals']
heatmap_data = olympics_df[columns]
#Corr Matrix
correlation_matrix = heatmap_data.corr()
# Labels
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()
23
plt.title('Medal Distribution by Country 2012 London Olympics')
plt.axis('equal') # to make pie circular
plt.show()
f) A histogram.
Solution:
plt.show()
24
6.) Find and tell a Story with data! To this end, first go to
https://uol.de/planung-entwicklung/akademisches-controlling/studium-und-lehre
and select one of the following data sets:
• Studienanfängerinnen / Studienanfängerinnen (Fallstatistik) nach Studiengang
• Fachstudiendauer / Übersicht über die Fachstudiendauer
Then explore the data and find a story to tell with it.
a) Define an audience and a goal.
(Example: A data viz that highlights a potential issue to the university council or one
that tries to win new students for a certain subject area.)
Solution:
25
4.1 One can choose their major by looking at this data, we have total graduate
number grouped by years and departments.It is also a good predictor to
understand whether the department is easy to finish or not.
[14]: df_ol = pd.read_excel('/content/Fachstudiendauer_2021_20220802_EV.xlsx',␣
↪index_col=0)
df_ol = df_ol[1:]
df_ol.dropna(inplace = True)
df_ol.head()
plt.show()
26
5 Part 4: Machine Learning Intro
<div style="position: absolute; top: -45px; right: 10px; padding: 5px; background-color: #ddd;
<span style=""> / 25</span>
</div>
7.) Give decision trees to represent the following Boolean functions:
a) A and B.
Solution:
A
/ \
/ \
/ \
B 0
/ \
/ \
1 0
27
Solution:
A
/ \
/ \
/ \
1 B
/ \
/ \
/ \
C 0
/ \
/ \
1 0
In this decision tree:
The root node represents the variable A. The left branch represents the value True (1) for A. The
right branch represents the value False (0) for A. The left leaf node represents the value True (1)
when A is True, and no further evaluation is needed. The right leaf node represents the variable
B. The left branch of B represents the value True (1) for B. The right branch of B represents the
value False (0) for B. The inner branch represents the variable C. The left branch of C represents
the value True (1) for C. The right branch of C represents the value False (0) for C. Please note
that this decision tree assumes that A, B, and C are binary variables (taking values of 0 or 1).
c) (A and B) or (C and D)
Solution:
A
/ \
/ \
/ \
B C
/ \ / \
/ \ / \
1 D 1 1
In this decision tree:
The root node represents the variable A. The left branch represents the value True (1) for A. The
right branch represents the value False (0) for A. The left leaf node represents the variable B when
A is True. The right leaf node represents the variable C when A is False. The left branch of B
represents the value True (1) for B. The right branch of B represents the variable D. The left leaf
node represents the value True (1) for D when B is True. The right leaf node represents the value
False (0) for D when B is False. The left branch of C represents the value True (1) for C. The right
branch of C represents the value True (1) for D. Please note that this decision tree assumes that
A, B, C, and D are binary variables (taking values of 0 or 1).
8.) Consider the following titanic dataset: https://www.kaggle.com/competitions/titanic/data
a) Load the test and training data sets. Briefly describe the dataset.
28
Solution:
PassengerId (Integer) : Contains the serialized ordered numbers that provide a unique id to each
passenger
Survived (Boolean) : It provided information whether a passenger survived or not using boolean
variables. 0 = Not survived, 1 = Survived
Pclass (Integer): It is the ticket class the passenger belongs too. 1= 1st class, 2 = 2nd class, 3 =
3rd class
Name(String) : The name of the passenger
Sex(String): Provides the gender of a passenger. Male or Female
Age(Integer): Provides the age of the passenger
Sibsp(Integers): Gives us information on the number of siblings or spouses boarding the ship.
parch(Integer): Gives us information on the number of parents or children boarding the ship
Ticket(Integer): Gives us the ticket number for the passenger
Fare(Integer): The fare for the ticket for a passenger
Cabin(String): The cabin number of the passenger in the ship
Embarked(String): Information on the port in which they embarked the titanic C = Cherbourg, Q
= Queenstown, S = Southampton
From the column description we can see that we can classify the data into categorical and numerical
data:
Categorical: Survived, Sex, Embarked, Pclass (ordinal)
Numerical: Age (Continous), Fare(Continous), Sibsp(Discrete), Parch(Discrete)
29
print("##################### Head #####################")
print(dataframe.head(head))
print("##################### Tail #####################")
print(dataframe.tail(head))
print("##################### NA #####################")
print(dataframe.isnull().sum())
print("##################### Quantiles #####################")
print(dataframe.quantile([0, 0.05, 0.50, 0.95, 0.99, 1]).T)
check_df(t_train)
30
1307 1307 NaN 3 Ware, Mr. Frederick male NaN
0 0 359309 8.050 NaN S
1308 1308 NaN 3 Peter, Master. Michael J male NaN
1 1 2668 22.358 NaN C
##################### NA #####################
index 0
Survived 418
Pclass 0
Name 0
Sex 0
Age 263
SibSp 0
Parch 0
Ticket 0
Fare 1
Cabin 1014
Embarked 2
dtype: int64
##################### Quantiles #####################
0.000 0.050 0.500 0.950 0.990 1.000
index 0.000 65.400 654.000 1242.600 1294.920 1308.000
Survived 0.000 0.000 0.000 1.000 1.000 1.000
Pclass 1.000 1.000 3.000 3.000 3.000 3.000
Age 0.170 5.000 28.000 57.000 65.000 80.000
SibSp 0.000 0.000 0.000 2.000 5.000 8.000
Parch 0.000 0.000 0.000 2.000 4.000 9.000
Fare 0.000 7.225 14.454 133.650 262.375 512.329
<ipython-input-16-87edc5a594fb>:25: FutureWarning: The default value of
numeric_only in DataFrame.quantile is deprecated. In a future version, it will
default to False. Select only valid columns or specify the value of numeric_only
to silence this warning.
print(dataframe.quantile([0, 0.05, 0.50, 0.95, 0.99, 1]).T)
# num_cols
31
num_cols = [col for col in dataframe.columns if dataframe[col].dtypes !=␣
↪"O"]
num_cols = [col for col in num_cols if col not in num_but_cat]
print(f"Observations: {dataframe.shape[0]}")
print(f"Variables: {dataframe.shape[1]}")
print(f'cat_cols: {len(cat_cols)}')
print(f'num_cols: {len(num_cols)}')
print(f'cat_but_car: {len(cat_but_car)}')
print(f'num_but_cat: {len(num_but_cat)}')
return cat_cols, num_cols, cat_but_car
Observations: 1309
Variables: 12
cat_cols: 6
num_cols: 3
cat_but_car: 3
num_but_cat: 4
return True
else:
return False
index False
Age False
Fare False
32
5.1 Titanic train dataset has 1309 rows and 12 columns on total. We have 177
NA values in ‘Age’ variable 687 in Cabin and 2 in Embarked. This dataset
is widely used in Kaggle classification competitions. Our main aim is to
predict who’s gonna survive and who’s not.
[ ]:
b) Train a random forest classifier to predict survival chances for Titanic passengers.
(Hint: You can use one of the tutorials/submissions as a starting point.)
Solution:
[ ]: # Modeling
# First we need to find our categorical variables that has more than 3 unique␣
↪classses and less than 10. To create our dummy variables to incorporate them␣
↪in model.
return dataframe
ohe_cols = [col for col in t_train.columns if 10 >= t_train[col].nunique() >= 2]
df = one_hot_encoder(t_train, ohe_cols,drop_first=True)
df.head()
# Also we need to get rid of some variables that are not suitable for modeling.
#These are: Name, Ticket, Index(PassengerId) and Cabin (because it includes too␣
↪many NA Values)
df.drop(['index','Name','Ticket','Cabin'],axis=1,inplace=True)
33
0 0 0 0 0 0 0 0 0
0
3 3 35.000 53.100 1 0 0 0 1
0 0 0 0 0 0 0 0 0
0
4 4 35.000 8.050 0 0 1 1 0
0 0 0 0 0 0 0 0 0
0
rs = RobustScaler()
nums = df[['Age', 'Fare']]
rs.fit_transform(nums)
[ ]: array([[-0.33333333, -0.30965392],
[ 0.55555556, 2.02307104],
[-0.11111111, -0.28506375],
…,
[ 0. , -0.29052823],
[ 0.61111111, 3.39344262],
[ 0.58333333, -0.30965392]])
[ ]: # the Model
#Splitting Data into Train and Test
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
Accuracy: 0.7482517482517482
34
[ ]:
"""
Accuracy: 0.7955 (Logistic Regression)
Accuracy: 0.6892 (K-Nearest Neighbors)
Accuracy: 0.7913 (Decision Tree)
Accuracy: 0.7984 (Random Forest)
Accuracy: 0.6654 (Support Vector Machine)
Accuracy: 0.8193 (Gradient Boosting)
Accuracy: 0.7998 (XGBoost)
Accuracy: 0.8194 (LightGBM)
Accuracy: 0.8292 (CatBoost)
"""
35
# Our Accuracy values for each model without any model tuning, best one is␣
↪catboost, which is normal if we consider it's built for handle categorical␣
↪variables.
Increase the number of iterations (max_iter) or scale the data as shown in:
36
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-
regression
n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458:
ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-
regression
n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458:
ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-
regression
n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458:
ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-
regression
n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458:
ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-
regression
n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458:
ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
37
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-
regression
n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458:
ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-
regression
n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458:
ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-
regression
n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458:
ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-
regression
n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458:
ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-
regression
n_iter_i = _check_optimize_result(
38
Accuracy: 0.7955 (Logistic Regression)
Accuracy: 0.6892 (K-Nearest Neighbors)
Accuracy: 0.7913 (Decision Tree)
Accuracy: 0.7984 (Random Forest)
Accuracy: 0.6654 (Support Vector Machine)
Accuracy: 0.8193 (Gradient Boosting)
Accuracy: 0.7998 (XGBoost)
Accuracy: 0.8194 (LightGBM)
Accuracy: 0.8292 (CatBoost)
[ ]: # Model Tuning
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from catboost import CatBoostClassifier
from sklearn.model_selection import RandomizedSearchCV
# hyperparameter grids
rf_params = {"max_depth": [5, 8, 15, None],
"max_features": [5, 7, "auto"],
"min_samples_split": [8, 15, 20],
"n_estimators": [200, 500, 1000]}
# classifiers
classifiers = [("RF", RandomForestClassifier(), rf_params),
('GBM', GradientBoostingClassifier(), gbm_params),
('CatBoost', CatBoostClassifier(verbose=False), catboost_params)]
random_search.fit(X, y)
39
'max_features': 7, 'max_depth': 8}
Mean cross-validated score for RF: 0.8180946791862285
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py:305:
UserWarning: The total space of parameters 8 is smaller than n_iter=10. Running
8 iterations. For exhaustive searches, use GridSearchCV.
warnings.warn(
Best parameters for CatBoost: {'learning_rate': 0.1, 'iterations': 200, 'depth':
6}
Mean cross-validated score for CatBoost: 0.831964006259781
# hyperparameter grids
rf_params = {"max_depth": [5, 8, 15, None],
"max_features": [5, 7, "auto"],
"min_samples_split": [8, 15, 20],
"n_estimators": [200, 500, 1000]}
# classifiers
classifiers = [("RF", RandomForestClassifier(), rf_params),
('GBM', GradientBoostingClassifier(), gbm_params),
('CatBoost', CatBoostClassifier(verbose=False), catboost_params)]
40
accuracy = np.mean(cross_val_score(classifier, X, y, cv=10,␣
↪scoring="accuracy"))
print(f"Accuracy: {round(accuracy, 4)} ({name}) ")
random_search.fit(X, y)
best_models
########## RF ##########
Accuracy: 0.8027 (RF)
Accuracy (After): 0.8279 (RF)
RF best params: {'n_estimators': 200, 'min_samples_split': 8, 'max_features': 7,
'max_depth': 15}
41
[ ]: {'RF': RandomForestClassifier(max_depth=15, max_features=7, min_samples_split=8,
n_estimators=200),
'GBM': GradientBoostingClassifier(learning_rate=0.01, n_estimators=1000,
subsample=0.5),
'CatBoost': <catboost.core.CatBoostClassifier at 0x7ff1a51125c0>}
voting='hard')
voting_clf.fit(X, y)
Accuracy: 0.8306
6 Finally: Submission
Save your notebook and submit it (as both notebook and PDF file). And please don’t forget to
…
- … choose a file name according to convention (see Exercise Sheet 1, but please add your group
name as a suffix like _group01) and to
- … include the execution output in your submission!
42