Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

Submission_Group_12_Exercise_5 (1)

June 12, 2023

1 Data Science I
1.1 Exercise 5: Data Visualization & Machine Learning Intro
<div style="position: absolute; top: -90px; right: 10px; padding: 5px; background-color: #ddd;
<span style="font-weight: bold;">Overall Points: &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; / 100</span
Submission Deadline: June 12 2023, 07:00 UTC
University of Oldenburg Summer 2023 Instructors: Maria Fernanda “MaFe” Davila Re-
strepo, Wolfram “Wolle” Wingerath
Submitted by: < Uzun,Burak - Yalcin,Mehmet - Akalin, Alp >

2 Part 1: Exploratory Data Analysis


<div style="position: absolute; top: -45px; right: 10px; padding: 5px; background-color: #ddd;
<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; / 25</span>
</div>
1.) Provide answers to the questions associated with the following data sets, available
at
http://www.data-manual.com/data.
a) Analyze the movie data set. What is the range of movie gross in the United States?
Which type of movies are most likely to succeed in the market? Comedy? PG-13?
Drama? Why?
Solution:
Major Genre Adventure 11150747849

Action 8492622725

Comedy 6252564583

Drama 4998404600

1
Thriller/Suspense 2148224436

Horror 1359616191

Adventure films frequently have a broad appeal and can draw in a variety of viewers. Action,
adventure, and exploration are frequently present in these movies, which can captivate audiences
and create a rich cinematic experience.
Adventure movies offer a sense of escapism and give spectators the chance to participate virtu-
ally in exhilarating and thrilling adventures. Audiences can become fascinated and experience an
adrenaline rush because to the fascinating plotlines, exotic environments, and high-stakes action
scenes.
Difference between the movie has the maximum gross and the minimum gross: $531122065.

[2]: import pandas as pd


import numpy as np

# Pandas Options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

movie_df = pd.read_csv('/content/movies.csv', index_col=0)

movie_df.dropna(inplace=True) # Here I dropped the row that includes NA values


movie_df['Worldwide Gross'] = movie_df['Worldwide Gross'].astype('int')
df1 = movie_df.groupby('Major Genre')['Worldwide Gross'].sum()
sorted_df = df1.sort_values(ascending=False)
sorted_df
"""
Major Genre
Adventure 11150747849
Action 8492622725
Comedy 6252564583
Drama 4998404600
Thriller/Suspense 2148224436
Horror 1359616191
Romantic Comedy 1192544184
Musical 357789047
Western 69791889
Black Comedy 67348218
Documentary 13136074
Name: Worldwide Gross, dtype: int64

2
"""
# Range of the US Gross

range = (movie_df['US Gross'].min(), movie_df['US Gross'].max())


print("Range of movie gross US:", range)

Range of movie gross US: ('100289690', '97690976')


b) Analyze the Manhattan rolling sales data set. Where in Manhattan is themost/least
expensive real estate located? What is the relationship between sales price and gross
square feet?
Solution:
The R-squared value is 0.337, which indicates that approximately 33.7% of the varia-
tion in the ‘SALE PRICE’ can be explained by the ‘GROSS SQUARE FEET’ variable.
Coefficient for ‘GROSS SQUARE FEET’ is 296.1355, which indicates that, on average,
for each additional square foot, the predicted ‘SALE PRICE’ increases by $296.1355.
Both the coefficient for ‘GROSS SQUARE FEET’ and the intercept have relatively
small standard errors, and their t-values are significantly different from zero (p <
0.05), indicating that they are statistically significant.

[3]: sales_df = pd.read_excel('/content/rollingsales_manhattan.xls', index_col=0)


sales_df = sales_df[3:].reset_index(drop=True)
sales_df.columns = sales_df.iloc[0]
sales_df = sales_df[1:]
sales_df['SALE PRICE'] = sales_df['SALE PRICE'].astype('int')
indexmax = sales_df['SALE PRICE'].idxmax()
indexmin = sales_df['SALE PRICE'].idxmin()

expensive_nh= sales_df.loc[indexmax, 'NEIGHBORHOOD']


expensive_nh
# most expensive house is located in 'TRIBECA'.

cheapest_nh = sales_df.loc[indexmin, 'NEIGHBORHOOD']


cheapest_nh
# cheapest real estate located in 'ALPHABET CITY'.
import statsmodels.api as sm
import matplotlib.pyplot as plt
sales_df['GROSS SQUARE FEET'] = sales_df['GROSS SQUARE FEET'].astype('int')
sales_df = sales_df.dropna(subset=['SALE PRICE', 'GROSS SQUARE FEET'])

X = sales_df['GROSS SQUARE FEET']


y = sales_df['SALE PRICE']
X = sm.add_constant(X) # Add a constant term to the predictor
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

3
# The R-squared value is 0.337, which indicates that approximately 33.7% of the␣
↪variation in the 'SALE PRICE' can be explained by the 'GROSS SQUARE FEET'␣

↪variable.

# Coefficient for 'GROSS SQUARE FEET' is 296.1355, which indicates that, on␣
↪average, for each additional square foot, the predicted 'SALE PRICE'␣

↪increases by $296.1355.

# Both the coefficient for 'GROSS SQUARE FEET' and the intercept have␣
↪relatively small standard errors, and their t-values are significantly␣

↪different from zero (p < 0.05), indicating that they are statistically␣

↪significant.

#Let's Visualize this relationship between these two


plt.scatter(sales_df['GROSS SQUARE FEET'], sales_df['SALE PRICE'])
plt.xlabel('GROSS SQUARE FEET')
plt.ylabel('SALE PRICE')
plt.title('Relationship between SALE PRICE and GROSS SQUARE FEET')
plt.show()

OLS Regression Results


==============================================================================
Dep. Variable: SALE PRICE R-squared: 0.337
Model: OLS Adj. R-squared: 0.337
Method: Least Squares F-statistic: 9627.
Date: Sun, 11 Jun 2023 Prob (F-statistic): 0.00
Time: 22:52:07 Log-Likelihood: -3.4433e+05
No. Observations: 18926 AIC: 6.887e+05
Df Residuals: 18924 BIC: 6.887e+05
Df Model: 1
Covariance Type: nonrobust
================================================================================
=====
coef std err t P>|t| [0.025
0.975]
--------------------------------------------------------------------------------
-----
const 1.126e+06 1.41e+05 7.976 0.000 8.49e+05
1.4e+06
GROSS SQUARE FEET 296.1355 3.018 98.117 0.000 290.220
302.051
==============================================================================
Omnibus: 45471.657 Durbin-Watson: 1.812
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1797454162.060
Skew: 24.211 Prob(JB): 0.00
Kurtosis: 1511.974 Cond. No. 4.71e+04
==============================================================================

4
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 4.71e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

c) Analyze the 2012 Olympic data set. What can you say about the relationship
between a country’s population and the number of medals it wins? What can you say
about the relationship between the ratio of female and male counts and the GDP of
that country?
Solution:
The R-squared value is 0.138, which indicates that approximately 13.8% of the varia-
tion in the ‘Number of Medals’ is explained by the ‘2010 population’. This means that
the model does not capture a significant amount of the variation in the number of
medals.Our p-value is 0.107 which is higher than the conventional threshold of 0.05,
suggesting that the relationship between the population and the number of medals
may not be statistically significant. The coefficient for the 2010 population is 3.175e-
08. This indicates that, on average, a one-unit increase in the population is associated
with an increase of approximately 0.00000003175 in the number of medals.

5
The R-squared value is 0.094, which indicates that approximately 9.4% of the variation
in the dependent variable 2011 GDP is explained by the independent variable Male-
Female Ratio. The p-value is 0.190, suggesting that the relationship between the
male-female athlete ratio and GDP may not be statistically significant. Maybe there
is a relationship between the GDP and Male-Female Ratio, because we can see that
developed countries with high GDP has a ratio like 1 or closer to 1. But according to
our results there is no significant relationship with country’s GDP and Male/Female
athletes ratio in 2012 London Olympics. I think GDP Per Cap might be a better
variable to observe this relationship.

[4]: olympics_df = pd.read_csv('/content/olympics.csv', index_col=0)


olympics_df.head()
olympics_df.reset_index(inplace = True) # Took the 'ISO country code' as a␣
↪column.

olympics_df.dtypes
'''
ISO country code object
Country name object
2011 GDP float64
2010 population int64
Female count int64
Male count int64
Gold medals int64
Silver medals int64
Bronze medals int64
dtype: object'''
olympics_df['2011 GDP'] = olympics_df['2011 GDP'].astype('int')
olympics_df.isna().sum() # 0 NA values
# Creating a new column to aggregate all the medals they won. I am doing this␣
↪because I am going to regress it with population.

olympics_df['Number of Medals'] = olympics_df['Gold medals'] +␣


↪olympics_df['Silver medals'] + olympics_df['Bronze medals']

olympics_df.head()
# Linear Regression

X = olympics_df['2010 population']
y = olympics_df['Number of Medals']
X = sm.add_constant(X)
# Fitting the Model
model = sm.OLS(y, X)
results = model.fit()

print(results.summary())
# Visualizing the Relationship
plt.scatter(olympics_df['2010 population'], olympics_df['Number of Medals'],␣
↪color='blue')

6
plt.xlabel('2010 Population')
plt.ylabel('Total Medals')
plt.title('Relationship between Population and Total Medals')
plt.show()

# The R-squared value is 0.138, which indicates that approximately 13.8% of the␣
↪variation in the 'Number of Medals' is explained by the '2010 population'.␣

↪This means that the model does not capture a significant amount of the␣

↪variation in the number of medals.

# Our p-value is 0.107 which is higher than the conventional threshold of 0.05,␣
↪suggesting that the relationship between the population and the number of␣

↪medals may not be statistically significant.

# The coefficient for the 2010 population is 3.175e-08. This indicates that, on␣
↪average, a one-unit increase in the population is associated with an␣

↪increase of approximately 0.00000003175 in the number of medals.

# Male-Female Count Ratio


# Creating a new feature to obtain the ratio
olympics_df['Male-Female Ratio'] = olympics_df['Male count'] /␣
↪olympics_df['Female count']

y = olympics_df['2011 GDP']
X = olympics_df['Male-Female Ratio']
X = sm.add_constant(X)

model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
# The R-squared value is 0.094, which indicates that approximately 9.4% of the␣
↪variation in the dependent variable 2011 GDP is explained by the independent␣

↪variable Male-Female Ratio.

# The p-value is 0.190, suggesting that the relationship between the␣


↪male-female athlete ratio and GDP may not be statistically significant.

# Maybe there is a relationship between the GDP and Male-Female Ratio, because␣
↪we can see that developed countries with high GDP has a ratio like 1 or␣

↪closer to 1. But according to our results there is no significant␣

↪relationship with country's GDP and Male/Female athletes ratio in 2012␣

↪London Olympics.

# I think GDP Per Cap might be a better variable to observe this relationship.
plt.scatter(olympics_df['Male-Female Ratio'], olympics_df['2011 GDP'])
plt.xlabel('Male-Female Ratio')
plt.ylabel('2011 GDP')
plt.title('Relationship between Male-Female Ratio and 2011 GDP')
plt.show()

OLS Regression Results


==============================================================================

7
Dep. Variable: Number of Medals R-squared: 0.138
Model: OLS Adj. R-squared: 0.090
Method: Least Squares F-statistic: 2.875
Date: Sun, 11 Jun 2023 Prob (F-statistic): 0.107
Time: 22:52:07 Log-Likelihood: -96.257
No. Observations: 20 AIC: 196.5
Df Residuals: 18 BIC: 198.5
Df Model: 1
Covariance Type: nonrobust
================================================================================
===
coef std err t P>|t| [0.025
0.975]
--------------------------------------------------------------------------------
---
const 21.4487 7.809 2.747 0.013 5.042
37.855
2010 population 3.175e-08 1.87e-08 1.696 0.107 -7.59e-09
7.11e-08
==============================================================================
Omnibus: 3.842 Durbin-Watson: 1.357
Prob(Omnibus): 0.146 Jarque-Bera (JB): 2.316
Skew: 0.824 Prob(JB): 0.314
Kurtosis: 3.258 Cond. No. 4.64e+08
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 4.64e+08. This might indicate that there are
strong multicollinearity or other numerical problems.

8
OLS Regression Results
==============================================================================
Dep. Variable: 2011 GDP R-squared: 0.094
Model: OLS Adj. R-squared: 0.043
Method: Least Squares F-statistic: 1.857
Date: Sun, 11 Jun 2023 Prob (F-statistic): 0.190
Time: 22:52:07 Log-Likelihood: -605.52
No. Observations: 20 AIC: 1215.
Df Residuals: 18 BIC: 1217.
Df Model: 1
Covariance Type: nonrobust
================================================================================
=====
coef std err t P>|t| [0.025
0.975]
--------------------------------------------------------------------------------
-----
const 3.647e+12 1.29e+12 2.831 0.011 9.41e+11
6.35e+12
Male-Female Ratio -8.002e+11 5.87e+11 -1.363 0.190 -2.03e+12

9
4.34e+11
==============================================================================
Omnibus: 26.230 Durbin-Watson: 0.613
Prob(Omnibus): 0.000 Jarque-Bera (JB): 42.571
Skew: 2.244 Prob(JB): 5.70e-10
Kurtosis: 8.563 Cond. No. 4.00
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.

d) Analyze the GDP per capita data set. How do countries from Europe, Asia, and
Africa compare in the rates of growth in GDP? When have countries faced substantial
changes in GDP, and what historical events were likely most responsible for it?
Solution:

10
2.1 Substantial Changes and Causes
When a country experience substantial growth or decrease in its GDP Per Cap, there might be
several reasons. First and most likely one is ‘Technological Changes’, in the field of economics, we
define technology as ‘Changes or innovations in the production methods’. That is why Industrial
Revolution is still considered as the most important development in the human history and civi-
lization. We can see the significant GDP Per Cap growth at the beginning of the 19th century.
This is also the time when we escaped the Malthusian trap.

Second biggest reason is wars, The world economies were significantly impacted by both the wars.
Due to the devastation, loss of infrastructure, and resource allocation for the war effort, the GDP
of many nations fell during the war periods. However, when nations rebuilt and industrialized, the
post-war era frequently experienced considerable economic recoveries and growth.

Other
rea-
sons
that
have
nega-
tive
im-
pacts
on
GDP
Growth
are
global
fi-
nan-
cial
cri-
sis,
colo-
niza-
tion,
pan-
demics.

It can be said that after the industrial revolution, GDP per Capita has increased considerably
in Europe, Africa and Asia. But the difference between Europe and Africa in particular can be
explained by the early industrializing European countries dominating global trade with new means
of production and colonial activities. Apart from this, if we consider that GDP Per Capita is
population indexed, we can explain the welfare of European countries with low-population.

11
[5]: gpercap = pd.read_excel('/content/gapdata.xlsx', sheet_name=1, index_col = 0)
gpercap.head()
gpercap.reset_index(inplace = True)
#Europe
# I am creating a new datasets by choosing the country by continents to make␣
↪observations.

european_countries = ['Germany', 'France', 'United Kingdom', 'Italy', 'Spain']


asian_countries = ['China', 'India', 'Japan', 'South Korea', 'Indonesia']
african_countries = ['Nigeria', 'Egypt', 'South Africa', 'Algeria', 'Morocco']

# Filtering the data by specific continents


europe_df = gpercap[gpercap['Area'].isin(european_countries)]

# Grouping the data by country


grouped_europe_df = europe_df.groupby('Area')

# Visualizing the GDP per capita values for each European country
plt.figure(figsize=(10, 6))
for country, i in grouped_europe_df:
plt.plot(i['Year'], i['GDP per capita - with interpolations'],␣
↪label=country)

plt.xlabel('Year')
plt.ylabel('GDP per Capita')
plt.title('GDP per Capita for European Countries')
plt.legend()
plt.show()

# Asian Countries
asia_df = gpercap[gpercap['Area'].isin(asian_countries)]
grouped_asia_df = asia_df.groupby('Area')

plt.figure(figsize=(10, 6))
for country, i in grouped_asia_df:
plt.plot(i['Year'], i['GDP per capita - with interpolations'],␣
↪label=country)

plt.xlabel('Year')
plt.ylabel('GDP per Capita')
plt.title('GDP per Capita for Asian Countries')
plt.legend()
plt.show()

# African Countries
africa_df = gpercap[gpercap['Area'].isin(african_countries)]
grouped_africa_df = africa_df.groupby('Area')

plt.figure(figsize=(10, 6))

12
for country, i in grouped_africa_df:
plt.plot(i['Year'], i['GDP per capita - with interpolations'],␣
↪label=country)

plt.xlabel('Year')
plt.ylabel('GDP per Capita')
plt.title('GDP per Capita for African Countries')
plt.legend()
plt.show()

13
2.) For one data set of your own choosing, answer the following basic questions:
a) Who constructed it, when, and why?

14
Solution:
It is created by Mattias Lindgren at 21/04/2014(Version 14) to gather all the GDP Per Cap
historical data. Data is good for research purposes because it even includes early 16th century for
almost every country.
b) How big is it?
Solution:
It has 58112 rows and 16 columns which are: ‘Area’, ‘Year’, ‘GDP per capita - with interpolations’,
‘Type of data’, ‘Data period’, ‘Longer description’, ‘Growth linked to which year’, ‘If use region or
neighbor - what is used’, ‘Source’, ‘Source notes’, ‘Other notes’, ‘Adjustments of data’, ‘Regional
average multiplied with spread-ot factor’, ‘Other footnotes’, ‘Other footnotes.1’, ‘Other footnotes.2’]

[6]: print(gpercap.shape, gpercap.columns)

(58112, 16) Index(['Area', 'Year', 'GDP per capita - with interpolations', 'Type
of data', 'Data period', 'Longer description', 'Growth linked to which year',
'If use region or neighbor - what is used', 'Source ', 'Source notes', 'Other
notes', 'Adjustments of data', 'Regional average multiplied with spread-ot
factor', 'Other footnotes', 'Other footnotes.1', 'Other footnotes.2'],
dtype='object')
c) Identify a few familiar or interpretable records.
Solution:
Substantial growth in United States GDP Per Cap during the World Wars.
d) Find out and describe what Tukey’s five number summary is and then provide one
for at least 3 different columns.
Solution:
The five-point summary, often known as Tukey’s summary statistics, is a technique for summarizing
a dataset’s distribution. It gives a brief description of the data’s central tendency, distribution, and
form.

[7]: # I do not have 3 numeric columns in my dataset, I am going to do print summary␣


↪statistic for gdp per cap and year.

columns = ['GDP per capita - with interpolations', 'Year']

summarystats = gpercap[columns].describe().loc[['min', '25%', '50%', '75%',␣


↪'max']]

print(summarystats)

GDP per capita - with interpolations Year


min 281.908597 1270.0
25% 699.661402 1831.0
50% 1206.836602 1894.0

15
75% 2538.269358 1957.0
max 119849.293354 2018.0
e) State at least one interesting or noteworthy thing that you Learned from your data
set.
Solution:
Most interesting thing is how these countries were very similar in terms of wealth before the 18th
century.

3 Part 2: Interpreting Visualizations


<div style="position: absolute; top: -45px; right: 10px; padding: 5px; background-color: #ddd;
<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; / 25</span>
</div>
3.) Search your favorite news websites until you find 4 interesting charts/plots, ideally
half good and half bad. For each, please critique along the following dimensions:
a) Does it do a good job or a bad job of presenting the data? Why?
b) Does the presentation appear to be biased, either deliberately or accidentally?
c) Is there chartjunk in the figure? Where?
d) Are the axes labeled in a clear and informative way?
e) Is the color used effectively?
f) How can the graphic be improved?
Solution:
a-1) Does it do a good job or a bad job of presenting the data? Why?
(“https://www.nbcnews.com/data-graphics/tyre-nichols-protests-erupted-united-states-
rcna67987”) a) The graphic does a good job of presenting the data. It is clear and easy to
understand, and it provides a good overview of the number of protests that have occurred in the
United States since the release of bodycam footage of Tyre Nichols’ fatal beating.
b-1) Does the presentation appear to be biased, either deliberately or accidentally?
((“https://www.nbcnews.com/data-graphics/tyre-nichols-protests-erupted-united-states-
rcna67987”)) The presentation does not appear to be biased, either deliberately or accidentally.
The graphic simply shows the number of protests that have occurred, and it does not make any
claims about why the protests are happening or who is involved in them.
c-1) Is there chartjunk in the figure? Where? (“https://www.nbcnews.com/data-graphics/tyre-
nichols-protests-erupted-united-states-rcna67987”) There is no chartjunk in the figure. The graphic
is simple and uncluttered, and it only includes information that is relevant to the data that is being
presented.
d-1) Are the axes labeled in a clear and informative way? (“https://www.nbcnews.com/data-
graphics/tyre-nichols-protests-erupted-united-states-rcna67987”) The axes are labeled in a clear
and informative way. The x-axis shows the date of the protest, and the y-axis shows the number
of protests that occurred on that date.
e-1) Is the color used effectively? (“https://www.nbcnews.com/data-graphics/tyre-nichols-protests-
erupted-united-states-rcna67987”) The color is used effectively in the graphic. The different colors

16
are used to distinguish between the different cities where protests have occurred, and this makes it
easy to see which cities have had the most protests.
f-1) How can the graphic be improved? (“https://www.nbcnews.com/data-graphics/tyre-nichols-
protests-erupted-united-states-rcna67987”) The graphic could be improved by including a legend
that explains the different colors that are used. This would make it even easier to understand the
data that is being presented.
Here are some other suggestions for how the graphic could be improved:
The graphic could be made interactive, so that users could click on a city to see more information
about the protests that have occurred there. The graphic could be updated regularly to reflect the
latest data on the number of protests that have occurred. The graphic could be made into a poster
or flyer that could be distributed to raise awareness of the issue of police brutality.
a-2) The graphic does a good job of presenting the data. It is clear and easy to understand, and
it provides a good overview of the amount of smoke from Canadian wildfires that has covered
the United States in the past month. (https://www.nbcnews.com/data-graphics/canada-wildfire-
smoke-covered-us-month-rcna87998)
b-2) The presentation does not appear to be biased, either deliberately or accidentally. The
graphic simply shows the amount of smoke that has covered the United States, and it does
not make any claims about the cause of the wildfires or the impact of the smoke on air quality.
(https://www.nbcnews.com/data-graphics/canada-wildfire-smoke-covered-us-month-rcna87998)
c-2) There is no chartjunk in the figure. The graphic is simple and uncluttered, and it only includes
information that is relevant to the data that is being presented. (https://www.nbcnews.com/data-
graphics/canada-wildfire-smoke-covered-us-month-rcna87998)
d-2) The axes are labeled in a clear and informative way. The x-axis shows the date, and the
y-axis shows the amount of smoke in parts per million (ppm). (https://www.nbcnews.com/data-
graphics/canada-wildfire-smoke-covered-us-month-rcna87998)
e-2) The color is used effectively in the graphic. The different colors are used to distinguish between
the different states that have been affected by smoke from Canadian wildfires, and this makes it easy
to see which states have been most affected. (https://www.nbcnews.com/data-graphics/canada-
wildfire-smoke-covered-us-month-rcna87998)
f-2) The graphic could be improved by including a legend that explains the different colors
that are used. This would make it even easier to understand the data that is being presented.
(https://www.nbcnews.com/data-graphics/canada-wildfire-smoke-covered-us-month-rcna87998)
Here are some other suggestions for how the graphic could be improved:
The graphic could be made interactive, so that users could click on a state to see more information
about the amount of smoke that has covered that state. The graphic could be updated regularly
to reflect the latest data on the amount of smoke from Canadian wildfires that is covering the
United States. The graphic could be made into a poster or flyer that could be distributed to raise
awareness of the issue of air quality and wildfires.
a-3) Does it do a good job or a bad job of presenting the data? Why?
(https://www.nbcnews.com/data-graphics/absentee-homeowners-crowding-housing-market-
data-rcna69828) I think the graphic does a good job of presenting the data. It is clear and easy to
understand, and it provides a good overview of the share of homes sold to absentee owners in nine

17
major metropolitan areas in the United States. The graphic shows that the share of homes sold to
absentee owners has increased since 2020 in all nine areas.
b-3) Does the presentation appear to be biased, either deliberately or accidentally?
(https://www.nbcnews.com/data-graphics/absentee-homeowners-crowding-housing-market-data-
rcna69828)
I do not think the presentation appears to be biased, either deliberately or accidentally. The
graphic simply shows the data, and it does not make any claims about why the share of homes sold
to absentee owners has increased.
c-3) Is there chartjunk in the figure? Where? (https://www.nbcnews.com/data-graphics/absentee-
homeowners-crowding-housing-market-data-rcna69828)
I do not see any chartjunk in the figure. The graphic is simple and uncluttered, and it only includes
information that is relevant to the data that is being presented.
d-3) Are the axes labeled in a clear and informative way? (https://www.nbcnews.com/data-
graphics/absentee-homeowners-crowding-housing-market-data-rcna69828)
Yes, the axes are labeled in a clear and informative way. The x-axis shows the year, and the y-axis
shows the percentage of homes sold to absentee owners.
e-3) Is the color used effectively? (https://www.nbcnews.com/data-graphics/absentee-homeowners-
crowding-housing-market-data-rcna69828)
Yes, the color is used effectively in the graphic. The different colors are used to distinguish between
the different metropolitan areas, and this makes it easy to see which areas have had the highest
and lowest shares of homes sold to absentee owners.
f-3) How can the graphic be improved? (https://www.nbcnews.com/data-graphics/absentee-
homeowners-crowding-housing-market-data-rcna69828)
The graphic could be improved by including a legend that explains the different colors that are
used. This would make it even easier to understand the data that is being presented.
Here are some other suggestions for how the graphic could be improved:
The graphic could be made interactive, so that users could click on a metropolitan area to see more
information about the share of homes sold to absentee owners in that area. The graphic could be
updated regularly to reflect the latest data on the share of homes sold to absentee owners in the
nine metropolitan areas. The graphic could be made into a poster or flyer that could be distributed
to raise awareness of the issue of absentee ownership in the housing market.
a-4) Does it do a good job or a bad job of presenting the data? Why?
(https://www.nbcnews.com/data-graphics/turkey-earthquake-map-see-aftershocks-rcna69713)
I think the graphic does a good job of presenting the data. It is clear and easy to understand,
and it provides a good overview of the location and magnitude of the earthquakes and aftershocks
that have occurred in Turkey and Syria since February 3, 2023. The graphic shows that the
earthquakes have been concentrated in a region along the border between Turkey and Syria, and
that the aftershocks have been gradually decreasing in magnitude.
b-4) Does the presentation appear to be biased, either deliberately or accidentally?
(https://www.nbcnews.com/data-graphics/turkey-earthquake-map-see-aftershocks-rcna69713)

18
I do not think the presentation appears to be biased, either deliberately or accidentally. The graphic
simply shows the data, and it does not make any claims about the cause of the earthquakes or the
impact of the earthquakes and aftershocks on the people of Turkey and Syria.
c-4) Is there chartjunk in the figure? Where? (https://www.nbcnews.com/data-graphics/turkey-
earthquake-map-see-aftershocks-rcna69713)
I do not see any chartjunk in the figure. The graphic is simple and uncluttered, and it only includes
information that is relevant to the data that is being presented.
d-4) Are the axes labeled in a clear and informative way? (https://www.nbcnews.com/data-
graphics/turkey-earthquake-map-see-aftershocks-rcna69713)
Yes, the axes are labeled in a clear and informative way. The x-axis shows the date of the earthquake
or aftershock, and the y-axis shows the magnitude of the earthquake or aftershock.
e-4) Is the color used effectively? (https://www.nbcnews.com/data-graphics/turkey-earthquake-
map-see-aftershocks-rcna69713)
Yes, the color is used effectively in the graphic. The different colors are used to distinguish between
the different earthquakes and aftershocks, and this makes it easy to see which earthquakes and
aftershocks have occurred.
f-4) How can the graphic be improved? (https://www.nbcnews.com/data-graphics/turkey-
earthquake-map-see-aftershocks-rcna69713)
The graphic could be improved by including a legend that explains the different colors that are
used. This would make it even easier to understand the data that is being presented.
Here are some other suggestions for how the graphic could be improved:
The graphic could be made interactive, so that users could click on an earthquake or aftershock to
see more information about it. The graphic could be updated regularly to reflect the latest data
on the earthquakes and aftershocks that have occurred in Turkey and Syria. The graphic could be
made into a poster or flyer that could be distributed to raise awareness of the risk of earthquakes
and aftershocks in Turkey and Syria.
4.) Visit https://viz.wtf and find five laughably bad visualizations. Explain why they
are both bad and amusing.
Solution:
1-)The “Bar Chart Race” That Doesn’t Race (https://bigthink.com/strange-maps/bar-chart-
races/)
This visualization is supposed to be a bar chart race, which is a type of animation that shows how
data changes over time. However, this visualization doesn’t actually race. The bars just sit there,
and the data changes in a very slow and unexciting way.
2-)The Pie Chart That Doesn’t Add Up (https://www.addtwodigital.com/add-two-
blog/2021/2/14/rule-4-the-values-in-your-pie-chart-should-add-up-to-100)
This pie chart is supposed to show the distribution of a population by age group. However, the
percentages don’t add up to 100%. This is a basic mistake that any data visualization should avoid.
3-)The Line Chart That Goes Off the Rails (https://www.fusioncharts.com/line-charts)

19
This line chart is supposed to show the stock market over time. However, the line goes off the rails
at one point. This is a sign that the data is unreliable or that the visualization is not properly
designed.
4-)The Scatterplot That’s Not Scattered (https://chartio.com/learn/charts/what-is-a-scatter-
plot/)
This scatterplot is supposed to show the relationship between two variables. However, the points
are not scattered. This is a sign that the data is not normally distributed or that the visualization
is not properly designed.
5-)The Map That’s Not to Scale (https://www.independent.co.uk/news/science/world-
map-mercator-peters-gall-projection-boston-globe-us-schools-european-colonial-distortion-bias-
a7639101.html)
This map is supposed to show the size of different countries. However, the countries are not to
scale. This is a sign that the map is not properly designed or that the data is not reliable.
These are just a few examples of laughably bad visualizations. There are many more out there,
and they can be quite amusing. However, they are also a reminder that data visualization is a skill
that takes time and practice to master.

4 Part 3: Creating Visualizations & Storytelling


<div style="position: absolute; top: -45px; right: 10px; padding: 5px; background-color: #ddd;
<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; / 25</span>
</div>
5.) Construct a revealing visualization of some aspect of your favorite data set, using:
a) A well-designed table.
Solution:

[8]: olympics_df = pd.read_csv('/content/olympics.csv', index_col=0)


olympics_df.reset_index(inplace = True)
olympics_df.head()
# I am using style method in pandas.
styled_df = (
olympics_df.style
.background_gradient(cmap='Blues') # gradient background color
.set_properties(subset=['Country name', '2010 population'],␣
↪**{'font-weight': 'bold'}) # bold
.highlight_max(subset='Gold medals', color='red') # maximum value in Gold␣
↪medals with red color

.highlight_max(subset='Silver medals', color='gray') # maximum value in␣


↪Silver medals with gray color

.highlight_max(subset='Bronze medals', color='pink') # maximum value in␣


↪Silver medals with pink color

20
# Display the styled table
styled_df

[8]: <pandas.io.formats.style.Styler at 0x7f47405ee5c0>

b) A dot and/or line plot.


Solution:

[9]: import matplotlib.pyplot as plt


olympics_df['Number of Medals'] = olympics_df['Gold medals'] +␣
↪olympics_df['Silver medals'] + olympics_df['Bronze medals']

countries = olympics_df['Country name']


medals = olympics_df['Number of Medals']
#Line Plot
plt.figure(figsize=(10, 6))
plt.plot(countries, medals, marker='o', linestyle='-', color='blue')
plt.xticks(rotation=90)
plt.xlabel('Country')
plt.ylabel('Number of Medals')
plt.title('2012 London Olympics - Number of Medals by Country')
plt.tight_layout()
plt.show()

c) A scatter plot.

21
Solution:

[10]: countries = olympics_df['Country name']


medals = olympics_df['Number of Medals']

# Scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(countries, medals, s=50, c='blue', alpha=0.7)
plt.xticks(rotation=90)
plt.xlabel('Country')
plt.ylabel('Number of Medals')
plt.title('2012 London Olympics - Number of Medals by Country')
plt.grid(True)
plt.tight_layout()

plt.show()

d) A heatmap.
Solution:

[11]: import matplotlib.pyplot as plt


import seaborn as sns

22
columns = ['2011 GDP', '2010 population', 'Number of Medals']
heatmap_data = olympics_df[columns]

#Corr Matrix
correlation_matrix = heatmap_data.corr()

# Labels
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')

plt.show()

e) A bar plot and/or a pie chart.


Solution:

[12]: medalnumbers = olympics_df.groupby('Country name')['Number of Medals'].sum()

#pie chart using matplotlib


medalnumbers.plot(kind='pie', autopct='%1.1f%%')

23
plt.title('Medal Distribution by Country 2012 London Olympics')
plt.axis('equal') # to make pie circular

plt.show()

f) A histogram.
Solution:

[13]: olympics_df['2011 GDP'] = olympics_df['2011 GDP'].astype('int')


plt.hist(olympics_df['2011 GDP'], bins=10, edgecolor='blue')

# Plot title and labels


plt.title('Distribution of GDP of the Countries')
plt.xlabel('GDP ')
plt.ylabel('Frequency')

plt.show()

24
6.) Find and tell a Story with data! To this end, first go to
https://uol.de/planung-entwicklung/akademisches-controlling/studium-und-lehre
and select one of the following data sets:
• Studienanfängerinnen / Studienanfängerinnen (Fallstatistik) nach Studiengang
• Fachstudiendauer / Übersicht über die Fachstudiendauer
Then explore the data and find a story to tell with it.
a) Define an audience and a goal.
(Example: A data viz that highlights a potential issue to the university council or one
that tries to win new students for a certain subject area.)
Solution:

25
4.1 One can choose their major by looking at this data, we have total graduate
number grouped by years and departments.It is also a good predictor to
understand whether the department is easy to finish or not.
[14]: df_ol = pd.read_excel('/content/Fachstudiendauer_2021_20220802_EV.xlsx',␣
↪index_col=0)

df_ol = df_ol[1:]
df_ol.dropna(inplace = True)
df_ol.head()

[14]: Lehreinheit Studienfach Abschluss


Regel-\nstudien-\nzeit 2012 2013 2014 2015 2016
2017 2018 2019 2020 2021 Gesampt Anzahl Abschlüsse
Fakultät
I Pädagogik Bildungs/Wissenschaftsman Fach-Master
5.0 7.800000 6.642857 7.250000 8.090909 8.875000 7.777778 10.375000
8.111111 8.714286 9.428571 8.309278 97.0
I Pädagogik Erzieh.-Bildungswissensch Fach-Master
4.0 4.212766 4.721311 5.234375 5.647059 5.612903 6.256410 6.388060
6.550000 6.300000 7.179487 5.720000 500.0
I Pädagogik Interk.-Bildung/Beratung Fach-Bachelor
6.0 7.400000 7.923077 8.666667 11.666667 6.142857 8.000000 15.500000
10.500000 18.500000 8.000000 9.040000 50.0
I Pädagogik Pädagogik Fach-Bachelor
6.0 6.385965 6.366197 6.864198 6.692308 7.050847 6.946429 7.138462
7.708861 8.096154 7.753623 7.096330 654.0
I Pädagogik Pädagogik Zwei-Fächer-Bachelor
6.0 6.775000 7.260870 7.163636 6.974359 7.404255 8.000000 7.520000
7.848485 7.711111 8.451613 7.468085 423.0

b) Create a (communicative) data visualization to help your cause.


Solution:

[15]: df_ol['Anzahl Abschlüsse'] = pd.to_numeric(df_ol['Anzahl Abschlüsse'],␣


↪errors='coerce')

graduates = df_ol.groupby('Lehreinheit')['Anzahl Abschlüsse'].sum()

#pie chart using matplotlib


graduates.plot(kind='pie', autopct='%1.1f%%')
plt.title('Number of Graduates Distribution by Lehreinheit')
plt.axis('equal') # to make pie circular

plt.show()

26
5 Part 4: Machine Learning Intro
<div style="position: absolute; top: -45px; right: 10px; padding: 5px; background-color: #ddd;
<span style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; / 25</span>
</div>
7.) Give decision trees to represent the following Boolean functions:
a) A and B.
Solution:
A
/ \
/ \
/ \
B 0
/ \
/ \
1 0

In this decision tree:


The root node represents the variable A. The left branch represents the value True (1) for A. The
right branch represents the value False (0) for A. The left leaf node represents the variable B when
A is True. The right leaf node represents the value False (0) when A is False. Please note that this
decision tree assumes that A and B are binary variables (taking values of 0 or 1).
b) A or (B and C).

27
Solution:
A
/ \
/ \
/ \
1 B
/ \
/ \
/ \
C 0
/ \
/ \
1 0
In this decision tree:
The root node represents the variable A. The left branch represents the value True (1) for A. The
right branch represents the value False (0) for A. The left leaf node represents the value True (1)
when A is True, and no further evaluation is needed. The right leaf node represents the variable
B. The left branch of B represents the value True (1) for B. The right branch of B represents the
value False (0) for B. The inner branch represents the variable C. The left branch of C represents
the value True (1) for C. The right branch of C represents the value False (0) for C. Please note
that this decision tree assumes that A, B, and C are binary variables (taking values of 0 or 1).
c) (A and B) or (C and D)
Solution:
A
/ \
/ \
/ \
B C
/ \ / \
/ \ / \
1 D 1 1
In this decision tree:
The root node represents the variable A. The left branch represents the value True (1) for A. The
right branch represents the value False (0) for A. The left leaf node represents the variable B when
A is True. The right leaf node represents the variable C when A is False. The left branch of B
represents the value True (1) for B. The right branch of B represents the variable D. The left leaf
node represents the value True (1) for D when B is True. The right leaf node represents the value
False (0) for D when B is False. The left branch of C represents the value True (1) for C. The right
branch of C represents the value True (1) for D. Please note that this decision tree assumes that
A, B, C, and D are binary variables (taking values of 0 or 1).
8.) Consider the following titanic dataset: https://www.kaggle.com/competitions/titanic/data
a) Load the test and training data sets. Briefly describe the dataset.

28
Solution:
PassengerId (Integer) : Contains the serialized ordered numbers that provide a unique id to each
passenger
Survived (Boolean) : It provided information whether a passenger survived or not using boolean
variables. 0 = Not survived, 1 = Survived
Pclass (Integer): It is the ticket class the passenger belongs too. 1= 1st class, 2 = 2nd class, 3 =
3rd class
Name(String) : The name of the passenger
Sex(String): Provides the gender of a passenger. Male or Female
Age(Integer): Provides the age of the passenger
Sibsp(Integers): Gives us information on the number of siblings or spouses boarding the ship.
parch(Integer): Gives us information on the number of parents or children boarding the ship
Ticket(Integer): Gives us the ticket number for the passenger
Fare(Integer): The fare for the ticket for a passenger
Cabin(String): The cabin number of the passenger in the ship
Embarked(String): Information on the port in which they embarked the titanic C = Cherbourg, Q
= Queenstown, S = Southampton
From the column description we can see that we can classify the data into categorical and numerical
data:
Categorical: Survived, Sex, Embarked, Pclass (ordinal)
Numerical: Age (Continous), Fare(Continous), Sibsp(Discrete), Parch(Discrete)

[16]: pd.set_option('display.max_columns', None)


pd.set_option('display.max_rows', 20)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.width', 170)

t_train = pd.read_csv('/content/train.csv', index_col=0)


t_test = pd.read_csv('/content/test.csv', index_col=0)
#Concat Datasets
t_train = pd.concat([t_train, t_test], axis=0, ignore_index=True)
t_train.reset_index(inplace = True)
t_train.head()
# I am using my pre-defined functions to analyze the data (https://www.kaggle.
↪com/code/burakuzn)

def check_df(dataframe, head=5):


print("##################### Shape #####################")
print(dataframe.shape)
print("##################### Types #####################")
print(dataframe.dtypes)

29
print("##################### Head #####################")
print(dataframe.head(head))
print("##################### Tail #####################")
print(dataframe.tail(head))
print("##################### NA #####################")
print(dataframe.isnull().sum())
print("##################### Quantiles #####################")
print(dataframe.quantile([0, 0.05, 0.50, 0.95, 0.99, 1]).T)
check_df(t_train)

##################### Shape #####################


(1309, 12)
##################### Types #####################
index int64
Survived float64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
##################### Head #####################
index Survived Pclass Name
Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 0 0.000 3 Braund, Mr. Owen Harris
male 22.000 1 0 A/5 21171 7.250 NaN S
1 1 1.000 1 Cumings, Mrs. John Bradley (Florence Briggs Th…
female 38.000 1 0 PC 17599 71.283 C85 C
2 2 1.000 3 Heikkinen, Miss. Laina
female 26.000 0 0 STON/O2. 3101282 7.925 NaN S
3 3 1.000 1 Futrelle, Mrs. Jacques Heath (Lily May Peel)
female 35.000 1 0 113803 53.100 C123 S
4 4 0.000 3 Allen, Mr. William Henry
male 35.000 0 0 373450 8.050 NaN S
##################### Tail #####################
index Survived Pclass Name Sex Age
SibSp Parch Ticket Fare Cabin Embarked
1304 1304 NaN 3 Spector, Mr. Woolf male NaN
0 0 A.5. 3236 8.050 NaN S
1305 1305 NaN 1 Oliva y Ocana, Dona. Fermina female 39.000
0 0 PC 17758 108.900 C105 C
1306 1306 NaN 3 Saether, Mr. Simon Sivertsen male 38.500
0 0 SOTON/O.Q. 3101262 7.250 NaN S

30
1307 1307 NaN 3 Ware, Mr. Frederick male NaN
0 0 359309 8.050 NaN S
1308 1308 NaN 3 Peter, Master. Michael J male NaN
1 1 2668 22.358 NaN C
##################### NA #####################
index 0
Survived 418
Pclass 0
Name 0
Sex 0
Age 263
SibSp 0
Parch 0
Ticket 0
Fare 1
Cabin 1014
Embarked 2
dtype: int64
##################### Quantiles #####################
0.000 0.050 0.500 0.950 0.990 1.000
index 0.000 65.400 654.000 1242.600 1294.920 1308.000
Survived 0.000 0.000 0.000 1.000 1.000 1.000
Pclass 1.000 1.000 3.000 3.000 3.000 3.000
Age 0.170 5.000 28.000 57.000 65.000 80.000
SibSp 0.000 0.000 0.000 2.000 5.000 8.000
Parch 0.000 0.000 0.000 2.000 4.000 9.000
Fare 0.000 7.225 14.454 133.650 262.375 512.329
<ipython-input-16-87edc5a594fb>:25: FutureWarning: The default value of
numeric_only in DataFrame.quantile is deprecated. In a future version, it will
default to False. Select only valid columns or specify the value of numeric_only
to silence this warning.
print(dataframe.quantile([0, 0.05, 0.50, 0.95, 0.99, 1]).T)

[ ]: # Numerical and Categorical Variable Analysis

def grab_col_names(dataframe, cat_th=10, car_th=20):


cat_cols = [col for col in dataframe.columns if dataframe[col].dtypes ==␣
↪"O"]

num_but_cat = [col for col in dataframe.columns if dataframe[col].nunique()␣


↪< cat_th and dataframe[col].dtypes != "O"]

cat_but_car = [col for col in dataframe.columns if dataframe[col].nunique()␣


↪> car_th and dataframe[col].dtypes == "O"]

cat_cols = cat_cols + num_but_cat


cat_cols = [col for col in cat_cols if col not in cat_but_car]

# num_cols

31
num_cols = [col for col in dataframe.columns if dataframe[col].dtypes !=␣
↪"O"]
num_cols = [col for col in num_cols if col not in num_but_cat]

print(f"Observations: {dataframe.shape[0]}")
print(f"Variables: {dataframe.shape[1]}")
print(f'cat_cols: {len(cat_cols)}')
print(f'num_cols: {len(num_cols)}')
print(f'cat_but_car: {len(cat_but_car)}')
print(f'num_but_cat: {len(num_but_cat)}')
return cat_cols, num_cols, cat_but_car

cat_cols, num_cols, cat_but_car = grab_col_names(t_train)

Observations: 1309
Variables: 12
cat_cols: 6
num_cols: 3
cat_but_car: 3
num_but_cat: 4

[ ]: # I am checking for outliers before to prepare data for modeling. quantile 1 =␣


↪0.05 quantile 3 = 0.99

def outlier_thresholds(dataframe, col_name, q1=0.05, q3=0.99):


quartile1 = dataframe[col_name].quantile(q1)
quartile3 = dataframe[col_name].quantile(q3)
interquantile_range = quartile3 - quartile1
up_limit = quartile3 + 1.5 * interquantile_range
low_limit = quartile1 - 1.5 * interquantile_range
return low_limit, up_limit

def check_outlier(dataframe, col_name):


low_limit, up_limit = outlier_thresholds(dataframe, col_name)
if dataframe[(dataframe[col_name] > up_limit) | (dataframe[col_name] <␣
↪low_limit)].any(axis=None):

return True
else:
return False

for col in num_cols:


print(col, check_outlier(t_train, col))

# We dont have any outliers according to our specified quantiles.

index False
Age False
Fare False

32
5.1 Titanic train dataset has 1309 rows and 12 columns on total. We have 177
NA values in ‘Age’ variable 687 in Cabin and 2 in Embarked. This dataset
is widely used in Kaggle classification competitions. Our main aim is to
predict who’s gonna survive and who’s not.
[ ]:

b) Train a random forest classifier to predict survival chances for Titanic passengers.
(Hint: You can use one of the tutorials/submissions as a starting point.)
Solution:

[ ]: # Modeling

# First we need to find our categorical variables that has more than 3 unique␣
↪classses and less than 10. To create our dummy variables to incorporate them␣

↪in model.

# I am using one hot encoder to get dummies,


def one_hot_encoder(dataframe, categorical_cols, drop_first=False):
dataframe = pd.get_dummies(dataframe, columns=categorical_cols,␣
↪drop_first=drop_first)

return dataframe
ohe_cols = [col for col in t_train.columns if 10 >= t_train[col].nunique() >= 2]
df = one_hot_encoder(t_train, ohe_cols,drop_first=True)
df.head()

#Now we have created 12 dummies for our categorical variables.

# Also we need to get rid of some variables that are not suitable for modeling.
#These are: Name, Ticket, Index(PassengerId) and Cabin (because it includes too␣
↪many NA Values)

df.drop(['index','Name','Ticket','Cabin'],axis=1,inplace=True)

# Dropping the NA Values


df.dropna(inplace = True)
# I didnt fill any NA Values.
df.head()

[ ]: index Age Fare Survived_1.0 Pclass_2 Pclass_3 Sex_male SibSp_1


SibSp_2 SibSp_3 SibSp_4 SibSp_5 SibSp_8 Parch_1 Parch_2 Parch_3 Parch_4
Parch_5 \
0 0 22.000 7.250 0 0 1 1 1
0 0 0 0 0 0 0 0 0
0
1 1 38.000 71.283 1 0 0 0 1
0 0 0 0 0 0 0 0 0
0
2 2 26.000 7.925 1 0 1 0 0

33
0 0 0 0 0 0 0 0 0
0
3 3 35.000 53.100 1 0 0 0 1
0 0 0 0 0 0 0 0 0
0
4 4 35.000 8.050 0 0 1 1 0
0 0 0 0 0 0 0 0 0
0

Parch_6 Parch_9 Embarked_Q Embarked_S


0 0 0 0 1
1 0 0 0 0
2 0 0 0 1
3 0 0 0 1
4 0 0 0 1

[ ]: #Scaling the Numerical Values


from sklearn.preprocessing import MinMaxScaler, LabelEncoder, StandardScaler,␣
↪RobustScaler, OneHotEncoder

rs = RobustScaler()
nums = df[['Age', 'Fare']]
rs.fit_transform(nums)

[ ]: array([[-0.33333333, -0.30965392],
[ 0.55555556, 2.02307104],
[-0.11111111, -0.28506375],
…,
[ 0. , -0.29052823],
[ 0.61111111, 3.39344262],
[ 0.58333333, -0.30965392]])

[ ]: # the Model
#Splitting Data into Train and Test
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣


↪random_state=42)

rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)


rf_classifier.fit(X_train, y_train)
y_pred = rf_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
# We have 74.82% accuracy, which is not bad without any feature engineering and␣
↪hyperparameter optimization.

Accuracy: 0.7482517482517482

34
[ ]:

c) Evaluate the performance of your model and iterate on it to improve it!


Solution:

[ ]: #!pip install catboost


from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.model_selection import cross_val_score

models = [('Logistic Regression', LogisticRegression()),


('K-Nearest Neighbors', KNeighborsClassifier()),
('Decision Tree', DecisionTreeClassifier()),
('Random Forest', RandomForestClassifier()),
('Support Vector Machine', SVC()),
('Gradient Boosting', GradientBoostingClassifier()),
('XGBoost', XGBClassifier()),
('LightGBM', LGBMClassifier()),
('CatBoost', CatBoostClassifier(verbose=False))
]
# Using 10 fold cross validation for each model
for name, classifier in models:
accuracy = np.mean(cross_val_score(classifier, X, y, cv=10,␣
↪scoring='accuracy'))

print(f"Accuracy: {round(accuracy, 4)} ({name})")

"""
Accuracy: 0.7955 (Logistic Regression)
Accuracy: 0.6892 (K-Nearest Neighbors)
Accuracy: 0.7913 (Decision Tree)
Accuracy: 0.7984 (Random Forest)
Accuracy: 0.6654 (Support Vector Machine)
Accuracy: 0.8193 (Gradient Boosting)
Accuracy: 0.7998 (XGBoost)
Accuracy: 0.8194 (LightGBM)
Accuracy: 0.8292 (CatBoost)
"""

35
# Our Accuracy values for each model without any model tuning, best one is␣
↪catboost, which is normal if we consider it's built for handle categorical␣

↪variables.

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-


wheels/public/simple/
Requirement already satisfied: catboost in /usr/local/lib/python3.10/dist-
packages (1.2)
Requirement already satisfied: graphviz in /usr/local/lib/python3.10/dist-
packages (from catboost) (0.20.1)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-
packages (from catboost) (3.7.1)
Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.10/dist-
packages (from catboost) (1.22.4)
Requirement already satisfied: pandas>=0.24 in /usr/local/lib/python3.10/dist-
packages (from catboost) (1.5.3)
Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages
(from catboost) (1.10.1)
Requirement already satisfied: plotly in /usr/local/lib/python3.10/dist-packages
(from catboost) (5.13.1)
Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages
(from catboost) (1.16.0)
Requirement already satisfied: python-dateutil>=2.8.1 in
/usr/local/lib/python3.10/dist-packages (from pandas>=0.24->catboost) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-
packages (from pandas>=0.24->catboost) (2022.7.1)
Requirement already satisfied: contourpy>=1.0.1 in
/usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (1.0.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-
packages (from matplotlib->catboost) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in
/usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (4.39.3)
Requirement already satisfied: kiwisolver>=1.0.1 in
/usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (1.4.4)
Requirement already satisfied: packaging>=20.0 in
/usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (23.1)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-
packages (from matplotlib->catboost) (8.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in
/usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (3.0.9)
Requirement already satisfied: tenacity>=6.2.0 in
/usr/local/lib/python3.10/dist-packages (from plotly->catboost) (8.2.2)
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458:
ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:

36
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-
regression
n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458:
ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-
regression
n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458:
ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-
regression
n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458:
ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-
regression
n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458:
ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-
regression
n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458:
ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

37
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-
regression
n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458:
ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-
regression
n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458:
ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-
regression
n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458:
ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-
regression
n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458:
ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-
regression
n_iter_i = _check_optimize_result(

38
Accuracy: 0.7955 (Logistic Regression)
Accuracy: 0.6892 (K-Nearest Neighbors)
Accuracy: 0.7913 (Decision Tree)
Accuracy: 0.7984 (Random Forest)
Accuracy: 0.6654 (Support Vector Machine)
Accuracy: 0.8193 (Gradient Boosting)
Accuracy: 0.7998 (XGBoost)
Accuracy: 0.8194 (LightGBM)
Accuracy: 0.8292 (CatBoost)

[ ]: # Model Tuning
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from catboost import CatBoostClassifier
from sklearn.model_selection import RandomizedSearchCV

# hyperparameter grids
rf_params = {"max_depth": [5, 8, 15, None],
"max_features": [5, 7, "auto"],
"min_samples_split": [8, 15, 20],
"n_estimators": [200, 500, 1000]}

gbm_params = {"learning_rate": [0.01, 0.1],


"max_depth": [3, 8],
"n_estimators": [500, 1000],
"subsample": [1, 0.5, 0.7]}

catboost_params = {"iterations": [200, 500],


"learning_rate": [0.01, 0.1],
"depth": [3, 6]}

# classifiers
classifiers = [("RF", RandomForestClassifier(), rf_params),
('GBM', GradientBoostingClassifier(), gbm_params),
('CatBoost', CatBoostClassifier(verbose=False), catboost_params)]

# Iterate over the classifiers


for name, classifier, param_grid in classifiers:
# Perform randomized search with cross-validation
random_search = RandomizedSearchCV(classifier, param_grid, cv=10,␣
↪scoring='accuracy', n_iter=10, n_jobs=-1)

random_search.fit(X, y)

# Print the best parameters and the mean cross-validated score


print(f"Best parameters for {name}: {random_search.best_params_}")
print(f"Mean cross-validated score for {name}: {random_search.best_score_}")
print()

Best parameters for RF: {'n_estimators': 200, 'min_samples_split': 8,

39
'max_features': 7, 'max_depth': 8}
Mean cross-validated score for RF: 0.8180946791862285

Best parameters for GBM: {'subsample': 0.5, 'n_estimators': 500, 'max_depth': 3,


'learning_rate': 0.01}
Mean cross-validated score for GBM: 0.8151799687010955

/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py:305:
UserWarning: The total space of parameters 8 is smaller than n_iter=10. Running
8 iterations. For exhaustive searches, use GridSearchCV.
warnings.warn(
Best parameters for CatBoost: {'learning_rate': 0.1, 'iterations': 200, 'depth':
6}
Mean cross-validated score for CatBoost: 0.831964006259781

[ ]: from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier


from catboost import CatBoostClassifier
from sklearn.model_selection import RandomizedSearchCV

# hyperparameter grids
rf_params = {"max_depth": [5, 8, 15, None],
"max_features": [5, 7, "auto"],
"min_samples_split": [8, 15, 20],
"n_estimators": [200, 500, 1000]}

gbm_params = {"learning_rate": [0.01, 0.1],


"max_depth": [3, 8],
"n_estimators": [500, 1000],
"subsample": [1, 0.5, 0.7]}

catboost_params = {"iterations": [200, 500],


"learning_rate": [0.01, 0.1],
"depth": [3, 6]}

# classifiers
classifiers = [("RF", RandomForestClassifier(), rf_params),
('GBM', GradientBoostingClassifier(), gbm_params),
('CatBoost', CatBoostClassifier(verbose=False), catboost_params)]

# Iterate over the classifiers


best_models = {}
for name, classifier, param_grid in classifiers:
print(f"########## {name} ##########")

# Evaluation using default hyperparameters

40
accuracy = np.mean(cross_val_score(classifier, X, y, cv=10,␣
↪scoring="accuracy"))
print(f"Accuracy: {round(accuracy, 4)} ({name}) ")

# Perform randomized search to find the best hyperparameters


random_search = RandomizedSearchCV(classifier, param_grid, cv=3, n_iter=10,␣
↪scoring='accuracy', n_jobs=-1, verbose=False)

random_search.fit(X, y)

# Set the best hyperparameters to the classifier


final_model = classifier.set_params(**random_search.best_params_)

# Evaluation with best hyperparameters


accuracy = np.mean(cross_val_score(final_model, X, y, cv=10,␣
↪scoring="accuracy"))

print(f"Accuracy (After): {round(accuracy, 4)} ({name}) ")

# Save the best model in the dictionary


best_models[name] = final_model

print(f"{name} best params: {random_search.best_params_}", end="\n\n")

best_models

# After model tuning our RF Classifier accuracy score is %82.79.

########## RF ##########
Accuracy: 0.8027 (RF)
Accuracy (After): 0.8279 (RF)
RF best params: {'n_estimators': 200, 'min_samples_split': 8, 'max_features': 7,
'max_depth': 15}

########## GBM ##########


Accuracy: 0.8165 (GBM)
Accuracy (After): 0.8307 (GBM)
GBM best params: {'subsample': 0.5, 'n_estimators': 1000, 'max_depth': 3,
'learning_rate': 0.01}

########## CatBoost ##########


Accuracy: 0.8292 (CatBoost)
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py:305:
UserWarning: The total space of parameters 8 is smaller than n_iter=10. Running
8 iterations. For exhaustive searches, use GridSearchCV.
warnings.warn(
Accuracy (After): 0.8306 (CatBoost)
CatBoost best params: {'learning_rate': 0.1, 'iterations': 200, 'depth': 3}

41
[ ]: {'RF': RandomForestClassifier(max_depth=15, max_features=7, min_samples_split=8,
n_estimators=200),
'GBM': GradientBoostingClassifier(learning_rate=0.01, n_estimators=1000,
subsample=0.5),
'CatBoost': <catboost.core.CatBoostClassifier at 0x7ff1a51125c0>}

[ ]: # Ensemble Learning (Voting Classifier)

from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(estimators=[('RF', best_models["RF"]),


('GBM', best_models["GBM"]),
('CatBoost',␣
↪best_models["CatBoost"])],

voting='hard')
voting_clf.fit(X, y)

accuracy = np.mean(cross_val_score(voting_clf, X, y, cv=10, scoring="accuracy"))


print(f"Accuracy: {round(accuracy, 4)}")
# Our Voting Classifier has 83.6% accuracy.

Accuracy: 0.8306

6 Finally: Submission
Save your notebook and submit it (as both notebook and PDF file). And please don’t forget to

- … choose a file name according to convention (see Exercise Sheet 1, but please add your group
name as a suffix like _group01) and to
- … include the execution output in your submission!

42

You might also like