Extended Case Code Sample

extended_case_2_fellow
January 13, 2021
1 The adverse health effects of air pollution - are we making any

progress?
Credit: Flickr/E4C
In [1]: # Load relevant packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as sm
import warnings
import fiona
import geopandas as gpd
import descartes as dsc
import math
import matplotlib.axis as ax
from mpl_toolkits import mplot3d
%matplotlib inline
from pandas.plotting import andrews_curves
warnings.filterwarnings("ignore") # Suppress all warnings
In [2]: %env PROJ_DIR=c:\
env: PROJ_DIR=c:\
In [3]: world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

world.plot();
1
In [102]: fig, ax = plt.subplots(1, 1)
world
Out[102]: pop_est continent name iso_a3 \

0 28400000.0 Asia Afghanistan AFG
1 12799293.0 Africa Angola AGO
2 3639453.0 Europe Albania ALB
3 4798491.0 Asia United Arab Emirates ARE
4 40913584.0 South America Argentina ARG
5 2967004.0 Asia Armenia ARM
6 3802.0 Antarctica Antarctica ATA
7 140.0 Seven seas (open ocean) Fr. S. Antarctic Lands ATF
8 21262641.0 Oceania Australia AUS
9 8210281.0 Europe Austria AUT
10 8238672.0 Asia Azerbaijan AZE
11 8988091.0 Africa Burundi BDI
12 10414336.0 Europe Belgium BEL
13 8791832.0 Africa Benin BEN
14 15746232.0 Africa Burkina Faso BFA
15 156050883.0 Asia Bangladesh BGD
16 7204687.0 Europe Bulgaria BGR
17 309156.0 North America Bahamas BHS
18 4613414.0 Europe Bosnia and Herz. BIH
19 9648533.0 Europe Belarus BLR
20 307899.0 North America Belize BLZ
21 9775246.0 South America Bolivia BOL
22 198739269.0 South America Brazil BRA
23 388190.0 Asia Brunei BRN
24 691141.0 Asia Bhutan BTN
25 1990876.0 Africa Botswana BWA
2
26 4511488.0 Africa Central African Rep. CAF
27 33487208.0 North America Canada CAN
28 7604467.0 Europe Switzerland CHE
29 16601707.0 South America Chile CHL
.. ... ... ... ...
147 7379339.0 Europe Serbia SRB
148 481267.0 South America Suriname SUR
149 5463046.0 Europe Slovakia SVK
150 2005692.0 Europe Slovenia SVN
151 9059651.0 Europe Sweden SWE
152 1123913.0 Africa Swaziland SWZ
153 20178485.0 Asia Syria SYR
154 10329208.0 Africa Chad TCD
155 6019877.0 Africa Togo TGO
156 65905410.0 Asia Thailand THA
157 7349145.0 Asia Tajikistan TJK
158 4884887.0 Asia Turkmenistan TKM
159 1131612.0 Asia Timor-Leste TLS
160 1310000.0 North America Trinidad and Tobago TTO
161 10486339.0 Africa Tunisia TUN
162 76805524.0 Asia Turkey TUR
163 22974347.0 Asia Taiwan TWN
164 41048532.0 Africa Tanzania TZA
165 32369558.0 Africa Uganda UGA
166 45700395.0 Europe Ukraine UKR
167 3494382.0 South America Uruguay URY
168 313973000.0 North America United States USA
169 27606007.0 Asia Uzbekistan UZB
170 26814843.0 South America Venezuela VEN
171 86967524.0 Asia Vietnam VNM
172 218519.0 Oceania Vanuatu VUT
173 23822783.0 Asia Yemen YEM
174 49052489.0 Africa South Africa ZAF
175 11862740.0 Africa Zambia ZMB
176 12619600.0 Africa Zimbabwe ZWE
gdp_md_est \
0 22270.0
1 110300.0
2 21810.0
3 184300.0
4 573900.0
5 18770.0
6 760.4
7 16.0
8 800200.0
9 329500.0
10 77610.0
3
11 3102.0
12 389300.0
13 12830.0
14 17820.0
15 224000.0
16 93750.0
17 9093.0
18 29700.0
19 114100.0
20 2536.0
21 43270.0
22 1993000.0
23 20250.0
24 3524.0
25 27060.0
26 3198.0
27 1300000.0
28 316700.0
29 244500.0
.. ...
147 80340.0
148 4254.0
149 119500.0
150 59340.0
151 344300.0
152 5702.0
153 98830.0
154 15860.0
155 5118.0
156 547400.0
157 13160.0
158 29780.0
159 2520.0
160 29010.0
161 81710.0
162 902700.0
163 712000.0
164 54250.0
165 39380.0
166 339800.0
167 43160.0
168 15094000.0
169 71670.0
170 357400.0
171 241700.0
172 988.5
173 55280.0
174 491000.0
4
175 17500.0
176 9323.0
0 POLYGON ((61.21081709172574 35.65007233330923, 62.23065148300589 35.270663967422

1 (POLYGON ((16.32652835456705 -5.877470391466218, 16.57317996589614 -6.6226445451
2
3
4 (POLYGON ((-65.50000000000003 -55.19999999999996, -66.45000000000002 -55.2500000
5
6 (POLYGON ((-59.57209469261153 -80.0401787250963, -59.86584937197472 -80.54965667
7
8 (POLYGON ((145.3979781434948 -40.79254851660589, 146.3641207216237 -41.137695407
9 POLYGON ((16.97966678230404 48.12349701597631, 16.90375410326726 47.714865627628
10 (POLYGON ((45.0019873390568 39.7400035670496, 45.29814497252144 39.4717512070224
11
12
13 POLYGON ((2.
14 POLYGON ((-2.827496303712707 9.642460842319778, -3.511898972986273 9.90032623945
15 POLYGON ((92.67272098182556 22.04123891854125, 92.65225711463799 21.324047552978
16 POLYGON ((22.65714969248299 44.23492300066128, 22.94483239105185 43.823785305347
17
18
19 POLYGON ((23.48412763844985 53.91249766704114, 24.45068362803704 53.905702216194
20
21 POLYGON ((-62.84646847192156 -22.03498544686945, -63.98683814152248 -21.99364430
22 POLYGON ((-57.62513342958296 -30.21629485445426, -56.29089962423908 -28.85276051
23
24
25 POLYGON ((25.64916344575016 -18.53602589281899, 25.85039147309473 -18.7144129370
26 POLYGON ((15.27946048346911 7.421924546737969, 16.10623172370677 7.4970879175065
27 (POLYGON ((-63.66449999999998 46.55000999999999, -62.93930000000003 46.415870000
28
29 (POLYGON ((-68.63401022758316 -52.63637045887437, -68.63334999999989 -54.8694999
..
147 POLYGON ((20.87431277841341 45.41637543393432, 21.48352623870221 45.181170152357
148 POLYGON ((-57.14743648947689 5.973149929219161, -55.9493184067898 5.7
149 POLYGON ((18.85314415861362 49.49622976337764, 18.90957482267632 49.435845852244
150
151 POLYGON ((22.18317345550193 65.72374054632017, 21.21351687997722 65.026005357515
152
153 POLYGON ((38.79234052913608 33.37868642835222, 36.83406212743554 32.31293752698
154 POLYGON ((14.4957873877629 12.85939626713736, 14.59578128424761 13.3304269474778
155
156 POLYGON ((102.5849324890267 12.18659495691328, 101.68715783082 12.64574005782657
157 POLYGON ((71.01419803252017 40.24436554621823, 70.64801883329997 39.935753892571
158 POLYGON ((61.21081709172574 35.65007233330923, 61.12307050969414 36.491597194966
159
5
160
161 POLYGON ((9.482139926805274 30.30755605724619, 9.055602654668149 32.102691962201
162 (POLYGON ((36.91312706884216 41.33535838476431, 38.34766482926452 40.94858612727
163
164 POLYGON ((33.9037111971046 -0.9499999999999886, 34.07261999999997 -1.05981999999
165 POLYGON ((31.86617000000007 -1.027359999999931, 30.76986000000011 -1.01
166 POLYGON ((31.78599816257159 52.10167796488545, 32.15941206231267 52.061266994833
167
168 (POLYGON ((-155.54211 19.08348000000001, -155.68817 18.91619000000003, -155.9366
169 POLYGON ((66.51860680528867 37.36278432875879, 66.54615034370022 37.974684963526
170 POLYGON ((-71.3315836249503 11.77628408451581, -71.36000566271082 11.53999359786
171 POLYGON ((108.0501802917829 21.55237986906012, 106.7150679870901 20.696850694252
172
173 POLYGON ((53.10857262554751 16.65105113368895, 52.38520592632588 16.382411200419
174 POLYGON ((31.52100141777888 -29.25738697684626, 31.325561150851 -29.401977634398
175 POLYGON ((32.75937544122132 -9.23059905358906, 33.2313879737753 -9.6767216935648
176 POLYGON ((31.19140913262129 -22.2515096981724, 30.65986535006709 -22.15156747811
[177 rows x 6 columns]
1.1 Introduction
Business Context. Air pollution is a very serious issue that the global population is currently
dealing with. The abundance of air pollutants is not only contributing to global warming, but it
is also causing problematic health issues to the population. There have been numerous efforts to
protect and improve air quality across most nations. However, it seems that we are making very
6
little progress. One of the main causes of this is the fact that the majority of air pollutants are
derived from the burning of fossil fuels such as coal. Big industries and several other economical
and political factors have slowed the progress towards the use of renewable energy by promoting
the use of fossil fuels. Nevertheless, if we educate the general population and create awareness of
this issue, we will be able to overcome this problem in the future.
For this case, you have been hired as a data science consultant for an important environmental
organization. In order to promote awareness of environmental and greenhouse gas issues, your
client is interested in a study of plausible impacts of air contamination on the health of the global
population. They have gathered some raw data provided by the World Health Organization, The
Institute for Health Metrics and Evaluation and the World Bank Group. Your task is to conduct
data analysis, search for potential information, and create visualizations that the client can use for
their campaigns and grant applications.
Analytical Context. You are given a folder, named files with raw data. This data contains
quite a large number of variables and it is in a fairly disorganized state. In addition, one of the
datasets contains very poor documentation, segmented into several datasets. Your objective will
be to:
1. Extract and clean the relevant data. You will have to manipulate several datasets to obtain
useful information for the case.
2. Conduct Exploratory Data Analysis. You will have to create meaningful plots, formulate
meaningful hypotheses and study the relationship between various indicators related to air
pollution.
Additionally, the client has some broad questions they would like to answer: 1. Are we making
any progress in reducing the amount of emitted pollutants across the globe? 2. Which are the
critical regions where we should start environmental campaigns? 3. Are we making any progress
in the prevention of deaths related to air pollution? 4. Which demographic characteristics seem to
correlate with the number of health-related issues derived from air pollution?
1.2 Extracting and cleaning relevant data

Let’s take a look at the data provided by the client in the files folder. There, we see another
folder named WDI_csv with several CSV files corresponding to the World Bank’s primary World
Development Indicators. The client stated that this data may contain some useful information
relevant to our study, but they have not told us anything aside from that. Thus, we are on our
own in finding and extracting the relevant data for our study. This we will do next.
Let’s take a peek at the file WDIData.csv:
In [5]: WDI_data = pd.read_csv("./files/WDI_csv/WDIData.csv")

print(WDI_data.columns)
print(WDI_data.info())
WDI_data
Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',

'1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
'1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
'1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
'1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
'1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
7
'2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
'2014', '2015', '2016', '2017', '2018', '2019', 'Unnamed: 64'],
dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 377256 entries, 0 to 377255
Data columns (total 65 columns):
Country Name 377256 non-null object
Country Code 377256 non-null object
Indicator Name 377256 non-null object
Indicator Code 377256 non-null object
1960 37395 non-null float64
8
Unnamed: 64 0 non-null float64
dtypes: float64(61), object(4)
memory usage: 187.1+ MB
None
Out[5]: Country Name Country Code \

0 Arab World ARB
1 Arab World ARB
2 Arab World ARB
3 Arab World ARB
4 Arab World ARB
5 Arab World ARB
6 Arab World ARB
7 Arab World ARB
8 Arab World ARB
9 Arab World ARB
10 Arab World ARB
11 Arab World ARB
12 Arab World ARB
13 Arab World ARB
14 Arab World ARB
15 Arab World ARB
16 Arab World ARB
17 Arab World ARB
18 Arab World ARB
9
19 Arab World ARB
20 Arab World ARB
21 Arab World ARB
22 Arab World ARB
23 Arab World ARB
24 Arab World ARB
25 Arab World ARB
26 Arab World ARB
27 Arab World ARB
28 Arab World ARB
29 Arab World ARB
... ... ...
377226 Zimbabwe ZWE
377227 Zimbabwe ZWE
377228 Zimbabwe ZWE
377229 Zimbabwe ZWE
377230 Zimbabwe ZWE
377231 Zimbabwe ZWE
377232 Zimbabwe ZWE
377233 Zimbabwe ZWE
377234 Zimbabwe ZWE
377235 Zimbabwe ZWE
377236 Zimbabwe ZWE
377237 Zimbabwe ZWE
377238 Zimbabwe ZWE
377239 Zimbabwe ZWE
377240 Zimbabwe ZWE
377241 Zimbabwe ZWE
377242 Zimbabwe ZWE
377243 Zimbabwe ZWE
377244 Zimbabwe ZWE
377245 Zimbabwe ZWE
377246 Zimbabwe ZWE
377247 Zimbabwe ZWE
377248 Zimbabwe ZWE
377249 Zimbabwe ZWE
377250 Zimbabwe ZWE
377251 Zimbabwe ZWE
377252 Zimbabwe ZWE
377253 Zimbabwe ZWE
377254 Zimbabwe ZWE
377255 Zimbabwe ZWE
Indicator Name \
0 2005 PPP conversion factor, GDP (LCU per inter...
1 2005 PPP conversion factor, private consumptio...
2 Access to clean fuels and technologies for coo...
3 Access to electricity (% of population)
10
4 Access to electricity, rural (% of rural popul...
5 Access to electricity, urban (% of urban popul...
6 Account ownership at a financial institution o...
15 Adequacy of social insurance programs (% of to...
16 Adequacy of social protection and labor progra...
17 Adequacy of social safety net programs (% of t...
18 Adequacy of unemployment benefits and ALMP (% ...
19 Adjusted net enrollment rate, primary (% of pr...
20 Adjusted net enrollment rate, primary, female ...
21 Adjusted net enrollment rate, primary, male (%...
22 Adjusted net national income (annual % growth)
23 Adjusted net national income (constant 2010 US$)
24 Adjusted net national income (current US$)
25 Adjusted net national income per capita (annua...
26 Adjusted net national income per capita (const...
27 Adjusted net national income per capita (curre...
28 Adjusted net savings, excluding particulate em...
29 Adjusted net savings, excluding particulate em...
... ...
377226 Urban population
377227 Urban population (% of total population)
377228 Urban population growth (annual %)
377229 Urban population living in areas where elevati...
377230 Urban poverty gap at national poverty lines (%)
377231 Urban poverty headcount ratio at national pove...
377232 Use of IMF credit (DOD, current US$)
377233 Use of insecticide-treated bed nets (% of unde...
377234 Value lost due to electrical outages (% of sal...
377235 Vitamin A supplementation coverage rate (% of ...
377236 Vulnerable employment, female (% of female emp...
377237 Vulnerable employment, male (% of male employm...
377238 Vulnerable employment, total (% of total emplo...
377239 Wage and salaried workers, female (% of female...
377240 Wage and salaried workers, male (% of male emp...
377241 Wage and salaried workers, total (% of total e...
377242 Wanted fertility rate (births per woman)
377243 Water productivity, total (constant 2010 US$ G...
377244 Wholesale price index (2010 = 100)
377245 Women making their own informed decisions rega...
377246 Women participating in the three decisions (ow...
11
377247 Women who believe a husband is justified in be...
377253 Women who were first married by age 15 (% of w...
377254 Women who were first married by age 18 (% of w...
377255 Women's share of population ages 15+ living wi...
Indicator Code 1960 1961 1962 \

0 PA.NUS.PPP.05 NaN NaN NaN
1 PA.NUS.PRVT.PP.05 NaN NaN NaN
2 EG.CFT.ACCS.ZS NaN NaN NaN
3 EG.ELC.ACCS.ZS NaN NaN NaN
4 EG.ELC.ACCS.RU.ZS NaN NaN NaN
5 EG.ELC.ACCS.UR.ZS NaN NaN NaN
6 FX.OWN.TOTL.ZS NaN NaN NaN
7 FX.OWN.TOTL.FE.ZS NaN NaN NaN
8 FX.OWN.TOTL.MA.ZS NaN NaN NaN
9 FX.OWN.TOTL.OL.ZS NaN NaN NaN
10 FX.OWN.TOTL.40.ZS NaN NaN NaN
11 FX.OWN.TOTL.PL.ZS NaN NaN NaN
12 FX.OWN.TOTL.60.ZS NaN NaN NaN
13 FX.OWN.TOTL.SO.ZS NaN NaN NaN
14 FX.OWN.TOTL.YG.ZS NaN NaN NaN
15 per_si_allsi.adq_pop_tot NaN NaN NaN
16 per_allsp.adq_pop_tot NaN NaN NaN
17 per_sa_allsa.adq_pop_tot NaN NaN NaN
18 per_lm_alllm.adq_pop_tot NaN NaN NaN
19 SE.PRM.TENR NaN NaN NaN
20 SE.PRM.TENR.FE NaN NaN NaN
21 SE.PRM.TENR.MA NaN NaN NaN
22 NY.ADJ.NNTY.KD.ZG NaN NaN NaN
23 NY.ADJ.NNTY.KD NaN NaN NaN
24 NY.ADJ.NNTY.CD NaN NaN NaN
25 NY.ADJ.NNTY.PC.KD.ZG NaN NaN NaN
26 NY.ADJ.NNTY.PC.KD NaN NaN NaN
27 NY.ADJ.NNTY.PC.CD NaN NaN NaN
28 NY.ADJ.SVNX.GN.ZS NaN NaN NaN
29 NY.ADJ.SVNX.CD NaN NaN NaN
... ... ... ... ...
377226 SP.URB.TOTL 476164.000000 500664.000000 528408.00000
377227 SP.URB.TOTL.IN.ZS 12.608000 12.821000 13.08200
377228 SP.URB.GROW 4.977308 5.017288 5.39335
377229 EN.POP.EL5M.UR.ZS NaN NaN NaN
377230 SI.POV.URGP NaN NaN NaN
377231 SI.POV.URHC NaN NaN NaN
12
377232 DT.DOD.DIMF.CD NaN NaN NaN
377233 SH.MLR.NETS.ZS NaN NaN NaN
377234 IC.FRM.OUTG.ZS NaN NaN NaN
377235 SN.ITK.VITA.ZS NaN NaN NaN
377236 SL.EMP.VULN.FE.ZS NaN NaN NaN
377237 SL.EMP.VULN.MA.ZS NaN NaN NaN
377238 SL.EMP.VULN.ZS NaN NaN NaN
377239 SL.EMP.WORK.FE.ZS NaN NaN NaN
377240 SL.EMP.WORK.MA.ZS NaN NaN NaN
377241 SL.EMP.WORK.ZS NaN NaN NaN
377242 SP.DYN.WFRT NaN NaN NaN
377243 ER.GDP.FWTL.M3.KD NaN NaN NaN
377244 FP.WPI.TOTL NaN NaN NaN
377245 SG.DMK.SRCR.FN.ZS NaN NaN NaN
377246 SG.DMK.ALLD.FN.ZS NaN NaN NaN
377247 SG.VAW.REAS.ZS NaN NaN NaN
377248 SG.VAW.ARGU.ZS NaN NaN NaN
377249 SG.VAW.BURN.ZS NaN NaN NaN
377250 SG.VAW.GOES.ZS NaN NaN NaN
377251 SG.VAW.NEGL.ZS NaN NaN NaN
377252 SG.VAW.REFU.ZS NaN NaN NaN
377253 SP.M15.2024.FE.ZS NaN NaN NaN
377254 SP.M18.2024.FE.ZS NaN NaN NaN
377255 SH.DYN.AIDS.FE.ZS NaN NaN NaN
1963 1964 1965 ... 2011 \

0 NaN NaN NaN ... NaN
2 NaN NaN NaN ... 8.278329e+01
3 NaN NaN NaN ... 8.642827e+01
4 NaN NaN NaN ... 7.394210e+01
5 NaN NaN NaN ... 9.593924e+01
6 NaN NaN NaN ... 2.226054e+01
7 NaN NaN NaN ... 1.377582e+01
8 NaN NaN NaN ... 3.037767e+01
9 NaN NaN NaN ... 2.574129e+01
10 NaN NaN NaN ... 1.606778e+01
11 NaN NaN NaN ... 1.405166e+01
12 NaN NaN NaN ... 2.642109e+01
13 NaN NaN NaN ... 3.070720e+01
14 NaN NaN NaN ... 1.461090e+01
19 NaN NaN NaN ... 8.509152e+01
20 NaN NaN NaN ... 8.344254e+01
21 NaN NaN NaN ... 8.671280e+01
13
22 NaN NaN NaN ... 1.193135e+01
23 NaN NaN NaN ... 1.903652e+12
25 NaN NaN NaN ... 9.382820e+00
26 NaN NaN NaN ... 5.241930e+03
27 NaN NaN NaN ... 5.416624e+03
... ... ... ... ... ...
377226 567387.00000 609178.00000 653686.000000 ... 4.257058e+06
377227 13.57800 14.09200 14.620000 ... 3.301500e+01
377228 7.11729 7.10689 7.051661 ... 9.896455e-01
377232 NaN NaN NaN ... 5.270949e+08
377233 NaN NaN NaN ... 9.700000e+00
377234 NaN NaN NaN ... 8.800000e+00
377235 NaN NaN NaN ... 4.700000e+01
377236 NaN NaN NaN ... 7.552700e+01
377237 NaN NaN NaN ... 5.570100e+01
377238 NaN NaN NaN ... 6.536100e+01
377239 NaN NaN NaN ... 2.415800e+01
377240 NaN NaN NaN ... 4.363800e+01
377241 NaN NaN NaN ... 3.414700e+01
377242 NaN NaN NaN ... 3.500000e+00
377245 NaN NaN NaN ... 5.880000e+01
377246 NaN NaN NaN ... 7.450000e+01
377247 NaN NaN NaN ... 3.960000e+01
377248 NaN NaN NaN ... 1.560000e+01
377249 NaN NaN NaN ... 7.500000e+00
377250 NaN NaN NaN ... 2.230000e+01
377251 NaN NaN NaN ... 2.140000e+01
377252 NaN NaN NaN ... 1.690000e+01
377253 NaN NaN NaN ... 3.900000e+00
377254 NaN NaN NaN ... 3.050000e+01
377255 NaN NaN NaN ... 5.910000e+01
2012 2013 2014 2015 2016 \

0 NaN NaN NaN NaN NaN
2 8.312030e+01 8.353346e+01 8.389760e+01 8.417160e+01 8.451017e+01
3 8.707058e+01 8.817684e+01 8.734274e+01 8.913012e+01 8.967869e+01
4 7.524410e+01 7.716230e+01 7.553898e+01 7.874115e+01 7.966564e+01
5 9.596217e+01 9.635293e+01 9.599783e+01 9.664992e+01 9.683418e+01
6 NaN NaN 3.027713e+01 NaN NaN
14
19 8.520714e+01 8.421832e+01 8.425430e+01 8.403523e+01 8.453258e+01
20 8.411878e+01 8.321839e+01 8.334494e+01 8.318996e+01 8.382028e+01
21 8.630059e+01 8.522583e+01 8.518359e+01 8.489517e+01 8.525464e+01
22 6.032670e+00 3.090463e+00 1.504003e+00 -5.557763e+00 1.480371e-01
23 2.018494e+12 2.080874e+12 2.112171e+12 1.994781e+12 1.997734e+12
25 3.667670e+00 8.472754e-01 -6.422262e-01 -7.494294e+00 -1.834019e+00
26 5.434187e+03 5.480229e+03 5.445034e+03 5.036967e+03 4.944588e+03
27 5.905730e+03 5.951002e+03 6.035730e+03 5.445677e+03 5.294685e+03
... ... ... ... ... ...
377226 4.306222e+06 4.359425e+06 4.416215e+06 4.473868e+06 4.531255e+06
377227 3.283400e+01 3.265400e+01 3.250400e+01 3.238500e+01 3.229600e+01
377228 1.148264e+00 1.227921e+00 1.294283e+00 1.297036e+00 1.274558e+00
377232 5.201243e+08 5.193419e+08 4.867292e+08 4.637526e+08 4.551646e+08
377233 NaN NaN 2.680000e+01 9.000000e+00 NaN
377234 NaN NaN NaN NaN 6.100000e+00
377235 6.100000e+01 3.400000e+01 3.200000e+01 4.500000e+01 3.500000e+01
377236 7.524300e+01 7.521700e+01 7.530400e+01 7.540300e+01 7.547700e+01
377237 5.622400e+01 5.617200e+01 5.622500e+01 5.628400e+01 5.631400e+01
377238 6.550400e+01 6.546900e+01 6.554400e+01 6.562600e+01 6.567600e+01
377239 2.444600e+01 2.447500e+01 2.439000e+01 2.429100e+01 2.421800e+01
377240 4.311900e+01 4.318300e+01 4.313400e+01 4.307700e+01 4.304900e+01
377241 3.400800e+01 3.405000e+01 3.397900e+01 3.389900e+01 3.384900e+01
377242 NaN NaN NaN 3.600000e+00 NaN
377245 NaN NaN NaN 5.990000e+01 NaN
377246 NaN NaN NaN 7.210000e+01 NaN
377247 NaN NaN 3.740000e+01 3.870000e+01 NaN
377248 NaN NaN NaN 1.670000e+01 NaN
377249 NaN NaN NaN 8.100000e+00 NaN
15
377250 NaN NaN NaN 2.280000e+01 NaN
377251 NaN NaN NaN 2.140000e+01 NaN
377252 NaN NaN NaN 1.450000e+01 NaN
377253 NaN NaN NaN 3.700000e+00 NaN
377254 NaN NaN 3.350000e+01 3.240000e+01 NaN
377255 5.930000e+01 5.950000e+01 5.960000e+01 5.960000e+01 5.970000e+01
2017 2018 2019 Unnamed: 64

0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 9.027369e+01 NaN NaN NaN
4 8.074929e+01 NaN NaN NaN
5 9.700397e+01 NaN NaN NaN
6 3.716521e+01 NaN NaN NaN
7 2.563540e+01 NaN NaN NaN
8 4.832852e+01 NaN NaN NaN
9 4.254205e+01 NaN NaN NaN
10 2.772478e+01 NaN NaN NaN
11 2.645811e+01 NaN NaN NaN
12 4.344695e+01 NaN NaN NaN
13 4.866698e+01 NaN NaN NaN
14 2.095480e+01 NaN NaN NaN
15 NaN NaN NaN NaN
16 NaN NaN NaN NaN
17 NaN NaN NaN NaN
18 NaN NaN NaN NaN
19 8.514375e+01 8.538422e+01 NaN NaN
20 8.399478e+01 8.425278e+01 NaN NaN
21 8.628836e+01 8.650601e+01 NaN NaN
22 2.554314e+00 NaN NaN NaN
23 2.048763e+12 NaN NaN NaN
24 NaN NaN NaN NaN
25 5.937190e-01 NaN NaN NaN
26 4.973945e+03 NaN NaN NaN
27 5.247986e+03 NaN NaN NaN
28 NaN NaN NaN NaN
29 NaN NaN NaN NaN
... ... ... ... ...
377226 4.589499e+06 4.650663e+06 NaN NaN
377227 3.223700e+01 3.220900e+01 NaN NaN
377228 1.277192e+00 1.323892e+00 NaN NaN
377229 NaN NaN NaN NaN
377232 4.821848e+08 4.708956e+08 NaN NaN
16
377235 4.300000e+01 NaN NaN NaN
377236 7.550500e+01 7.555400e+01 75.610003 NaN
377237 5.619900e+01 5.618400e+01 56.179000 NaN
377238 6.563600e+01 6.565200e+01 65.674000 NaN
377239 2.418900e+01 2.414100e+01 24.084000 NaN
377240 4.316200e+01 4.317700e+01 43.181999 NaN
377241 3.388800e+01 3.387200e+01 33.848999 NaN
377255 5.970000e+01 5.980000e+01 NaN NaN
The data seems to have a large number of indicators dating from 1960. There are also columns
containing country names and codes. Notice that the first couple of rows say Arab World, which
may indicate that the data contains broad regional data as well. We notice also that there are at
least 100,000 entries with NaN values for each year column.
Since we are interested in environmental indicators, we must get rid of any rows not relevant
to our study. However, the number of indicators seems to be quite large and a manual inspection
seems impossible. Let’s load the file WDISeries.csv which seems to contain more information
about the indicators:
In [6]: WDI_ids = pd.read_csv("./files/WDI_csv/WDISeries.csv")

print(WDI_ids.columns)
WDI_ids.head()
Index(['Series Code', 'Topic', 'Indicator Name', 'Short definition',

'Long definition', 'Unit of measure', 'Periodicity', 'Base Period',
'Other notes', 'Aggregation method', 'Limitations and exceptions',
'Notes from original source', 'General comments', 'Source',
'Statistical concept and methodology', 'Development relevance',
'Related source links', 'Other web links', 'Related indicators',
'License Type', 'Unnamed: 20'],
dtype='object')
Out[6]: Series Code Topic \

0 AG.AGR.TRAC.NO Environment: Agricultural production
17
1 AG.CON.FERT.PT.ZS Environment: Agricultural production
2 AG.CON.FERT.ZS Environment: Agricultural production
3 AG.LND.AGRI.K2 Environment: Land use
4 AG.LND.AGRI.ZS Environment: Land use
Indicator Name Short definition \

0 Agricultural machinery, tractors NaN
1 Fertilizer consumption (% of fertilizer produc... NaN
2 Fertilizer consumption (kilograms per hectare ... NaN
3 Agricultural land (sq. km) NaN
4 Agricultural land (% of land area) NaN
Long definition Unit of measure \

0 Agricultural machinery refers to the number of... NaN
1 Fertilizer consumption measures the quantity o... NaN
2 Fertilizer consumption measures the quantity o... NaN
3 Agricultural land refers to the share of land ... NaN
4 Agricultural land refers to the share of land ... NaN
Periodicity Base Period Other notes Aggregation method ... \

0 Annual NaN NaN Sum ...
1 Annual NaN NaN Weighted average ...
3 Annual NaN NaN Sum ...
Notes from original source General comments \

0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
Source \
0 Food and Agriculture Organization, electronic ...
Statistical concept and methodology \

0 A tractor provides the power and traction to m...
1 Fertilizer consumption measures the quantity o...
2 Fertilizer consumption measures the quantity o...
3 Agricultural land constitutes only a part of a...
4 Agriculture is still a major sector in many ec...
Development relevance Related source links \
18
0 Agricultural land covers more than one-third o... NaN
1 Factors such as the green revolution, has led ... NaN
2 Factors such as the green revolution, has led ... NaN
Other web links Related indicators License Type Unnamed: 20

0 NaN NaN CC BY-4.0 NaN
Bingo! The WDI_ids DataFrame contains a column named Topic. Moreover, it seems that
Environment is listed as a key topic in the column.
1.2.1 Exercise 1:
Extract all the rows that have the topic key Environment in WDI_ids. Add to the resulting
DataFrame a new column named Subtopic which contains the corresponding subtopic of the in-
dicator. For example, the subtopic of Environment: Agricultural production is Agricultural
production. Which subtopics do you think are of interest to us?
Hint: Remember that you can apply string methods to Series using the str() method of
pandas.
Answer.
In [7]: WDI_copy=WDI_ids.copy()
WDI_ids_sub = WDI_copy.loc[(WDI_copy['Topic'].str.contains('Environment'))]
WDI_ids_sub[['Main Topic','Subtopic']] = WDI_ids_sub.Topic.str.split(':',expand = True,
#WDI_ids_sub #Print content of dataframe.
1.2.2 Exercise 2:
Use the results of Exercise 1 to create a new DataFrame with the history of all emissions indicators
for countries and major regions. Call this new DataFrame Emissions_df. How many emissions
indicators are in the study?
Answer.
In [8]: WDI_data_copy = WDI_data.copy() #copying first dataframe to enable merge
WDI_data_copy.head() #printing the data frame
Emissions_df = pd.merge(WDI_data_copy,WDI_ids_sub,how='inner', left_on='Indicator Code'
#Emissions_df.head()
In [9]: Emissions_df = Emissions_df.loc[(Emissions_df['Subtopic'].str.contains('Emissions'))]
#Emissions_df
In [10]: len(Emissions_df['Indicator Code'].unique())
Out[10]: 42
19
1.2.3 Exercise 3:
The DataFrame Emissions_df has one column per year of observation. Data in this form is usually
referred to as data in wide format, as the number of columns is high. However, it might be easier
to query and filter the data if we had a single column containing the year in which each indicator
was calculated. This way, each observation will be represented by a single row. Use the pandas func-
tion melt() to reshape the Emissions_df data into long format. The resulting DataFrame should
contain a pair of new columns named Year and Indicator Value:
Answer.
In [11]: #arr=WDI_data.columns[4:65]
In [12]: Emissions_melt=Emissions_df.melt(id_vars=['Indicator Code','Indicator Name_x', 'Countr

value_vars=WDI_data.columns[4:65],
var_name='Year',
value_name='Indicator Value')
#Emissions_melt
1.2.4 Exercise 4:
The column Indicator Value of the new Emissions_df contains a bunch of NaN values. Addi-
tionally, the Year column contains an Unnamed: 64 value. What procedure should we follow to
clean these missing values in our DataFrame? Proceed with your suggested cleaning process.
Answer. I’m glad you asked. For the Indicator Value, I would exclude NaN values. For Year, I
would interpolate because
In [13]: Emissions_melt.dropna(inplace=True)
#Emissions_melt
1.2.5 Exercise 5:
Split the Emissions_df into two DataFrames, one containing only countries and the other con-
taining only regions. Name these Emissions_C_df and Emissions_R_df respectively.
Hint: You may want to inspect the file WDICountry.csv for this task. Region country codes
may be found by looking at null values of the Region column in WDICountry.
Answer.
In [14]: WDI_Country = pd.read_csv("./files/WDI_csv/WDICountry.csv")

print(WDI_Country.columns)
print(WDI_Country.info())
Index(['Country Code', 'Short Name', 'Table Name', 'Long Name', '2-alpha code',
'Currency Unit', 'Special Notes', 'Region', 'Income Group', 'WB-2 code',
'National accounts base year', 'National accounts reference year',
'SNA price valuation', 'Lending category', 'Other groups',
'System of National Accounts', 'Alternative conversion factor',
'PPP survey year', 'Balance of Payments Manual in use',
'External debt Reporting status', 'System of trade',
'Government Accounting concept', 'IMF data dissemination standard',
20
'Latest population census', 'Latest household survey',
'Source of most recent Income and expenditure data',
'Vital registration complete', 'Latest agricultural census',
'Latest industrial data', 'Latest trade data', 'Unnamed: 30'],
dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 263 entries, 0 to 262
Data columns (total 31 columns):
Country Code 263 non-null object
Short Name 263 non-null object
Table Name 263 non-null object
Long Name 263 non-null object
2-alpha code 261 non-null object
Currency Unit 217 non-null object
Special Notes 93 non-null object
Region 217 non-null object
Income Group 217 non-null object
WB-2 code 262 non-null object
National accounts base year 211 non-null object
National accounts reference year 71 non-null object
SNA price valuation 209 non-null object
Lending category 143 non-null object
Other groups 59 non-null object
System of National Accounts 206 non-null object
Alternative conversion factor 47 non-null object
PPP survey year 191 non-null object
Balance of Payments Manual in use 195 non-null object
External debt Reporting status 121 non-null object
System of trade 203 non-null object
Government Accounting concept 158 non-null object
IMF data dissemination standard 186 non-null object
Latest population census 217 non-null object
Latest household survey 152 non-null object
Source of most recent Income and expenditure data 168 non-null object
Vital registration complete 118 non-null object
Latest agricultural census 128 non-null object
Latest industrial data 147 non-null float64
Latest trade data 246 non-null float64
Unnamed: 30 0 non-null float64
dtypes: float64(3), object(28)
memory usage: 63.8+ KB
None
In [15]: #Export region codes as an array. Export the Country Code for all rows that have a nul
Region_Array=[]
x=0
21
for z in WDI_Country["Region"]:
if type(WDI_Country["Region"][x]) is float:
Region_Array.append(WDI_Country['Country Code'][x])
x+=1
#Region_Array
In [16]: #for Region_Array in Emissions_R_df["Country Code"]:

Emissions_R_df = Emissions_melt[Emissions_melt["Country Code"].isin(Region_Array)].cop
Emissions_R_df
Emissions_R_df.rename(columns={"Country Name":"Region Name"})
Out[16]: Indicator Code Indicator Name_x \

1059 EN.ATM.CO2E.KD.GD CO2 emissions (kgper 2010 US$ of GDP)
1848 EN.ATM.CO2E.KT CO2 emissions (kt)
... ... ...
641272 EN.ATM.PM25.MC.ZS PM2.5 air pollution, population exposed to lev...
22
Region Name Country Code Year \

1059 Early-demographic dividend EAR 1960
1060 East Asia & Pacific EAS 1960
1061 East Asia & Pacific (excluding high income) EAP 1960
1062 East Asia & Pacific (IDA & IBRD countries) TEA 1960
1067 European Union EUU 1960
1069 Heavily indebted poor countries (HIPC) HPC 1960
1070 High income HIC 1960
1071 IBRD only IBD 1960
1072 IDA & IBRD total IBT 1960
1073 IDA blend IDB 1960
1074 IDA only IDX 1960
1075 IDA total IDA 1960
1076 Late-demographic dividend LTE 1960
1077 Latin America & Caribbean LCN 1960
1078 Latin America & Caribbean (excluding high income) LAC 1960
1079 Latin America & the Caribbean (IDA & IBRD coun... TLA 1960
1081 Low & middle income LMY 1960
1083 Lower middle income LMC 1960
1087 Middle income MIC 1960
1088 North America NAC 1960
23
1090 OECD members OED 1960
1093 Post-demographic dividend PST 1960
1096 South Asia SAS 1960
1097 South Asia (IDA & IBRD) TSA 1960
1098 Sub-Saharan Africa SSF 1960
1099 Sub-Saharan Africa (excluding high income) SSA 1960
1100 Sub-Saharan Africa (IDA & IBRD countries) TSS 1960
1101 Upper middle income UMC 1960
1102 World WLD 1960
1848 Arab World ARB 1960
... ... ... ...
641272 IDA & IBRD total IBT 2017
641273 IDA blend IDB 2017
641274 IDA only IDX 2017
641275 IDA total IDA 2017
641276 Late-demographic dividend LTE 2017
641277 Latin America & Caribbean LCN 2017
641278 Latin America & Caribbean (excluding high income) LAC 2017
641279 Latin America & the Caribbean (IDA & IBRD coun... TLA 2017
641280 Least developed countries: UN classification LDC 2017
641281 Low & middle income LMY 2017
641282 Low income LIC 2017
641283 Lower middle income LMC 2017
641284 Middle East & North Africa MEA 2017
641285 Middle East & North Africa (excluding high inc... MNA 2017
641286 Middle East & North Africa (IDA & IBRD countries) TMN 2017
641287 Middle income MIC 2017
641288 North America NAC 2017
641290 OECD members OED 2017
641291 Other small states OSS 2017
641292 Pacific island small states PSS 2017
641293 Post-demographic dividend PST 2017
641294 Pre-demographic dividend PRE 2017
641295 Small states SST 2017
641296 South Asia SAS 2017
641297 South Asia (IDA & IBRD) TSA 2017
641298 Sub-Saharan Africa SSF 2017
641299 Sub-Saharan Africa (excluding high income) SSA 2017
641300 Sub-Saharan Africa (IDA & IBRD countries) TSS 2017
641301 Upper middle income UMC 2017
641302 World WLD 2017
Indicator Value
1059 0.521402
1060 0.906151
1061 3.299058
1062 3.256992
1067 0.574664
24
1069 0.193260
1070 0.633055
1071 1.217518
1072 1.124521
1073 0.442355
1074 0.240913
1075 0.345023
1076 1.905257
1077 0.359918
1078 0.357913
1079 0.350371
1081 1.066781
1083 0.658654
1087 1.088138
1088 0.889531
1090 0.634181
1093 0.662381
1096 0.733359
1097 0.733359
1098 0.497527
1099 0.497722
1100 0.497527
1101 1.223373
1102 0.827240
1848 59535.396567
... ...
641272 98.043929
641273 99.955056
641274 99.937059
641275 99.943178
641276 95.016903
641277 87.214912
641278 87.793750
641279 87.423722
641280 99.999947
641281 98.103124
641282 100.000000
641283 99.469737
641284 100.000000
641285 100.000000
641286 100.000000
641287 97.872134
641288 3.022545
641290 59.442776
641291 91.544650
641292 93.629896
641293 52.803165
641294 100.000000
25
641295 93.213857
641296 99.320900
641297 99.320900
641298 100.000000
641299 100.000000
641300 100.000000
641301 96.065069
641302 91.295708
In [17]: Emissions_C_df = Emissions_melt[Emissions_melt["Country Code"].isin(Region_Array)==Fal

Emissions_C_df
Out[17]: Indicator Code \

1105 EN.ATM.CO2E.KD.GD
... ...
642281 EN.ATM.PM25.MC.T3.ZS
26
645448 EN.ATM.CO2E.PC
Indicator Name_x \
1105 CO2 emissions (kg per 2010 US$ of GDP)
27
1145 CO2 emissions (kg per 2010 US$
GDP) of
GDP) of
GDP) of
GDP) of
GDP) of
GDP) of
GDP) of
GDP) of
GDP) of
GDP) of
GDP) of
GDP) of
... ...
642281 PM2.5 pollution, population exposed to levels ...
645448 CO2 emissions (metric tons per capita)
Country Name Country Code Year Indicator Value

1105 Algeria DZA 1960 0.224490
1110 Argentina ARG 1960 0.422371
1113 Australia AUS 1960 0.442913
28
1114 Austria AUT 1960 0.335608
1116 Bahamas, The BHS 1960 0.211426
1121 Belgium BEL 1960 0.763531
1122 Belize BLZ 1960 0.452195
1123 Benin BEN 1960 0.127381
1124 Bermuda BMU 1960 0.127571
1126 Bolivia BOL 1960 0.273275
1129 Brazil BRA 1960 0.190172
1133 Burkina Faso BFA 1960 0.038148
1137 Cameroon CMR 1960 0.054810
1138 Canada CAN 1960 0.654772
1140 Central African Republic CAF 1960 0.097822
1141 Chad TCD 1960 0.026319
1143 Chile CHL 1960 0.459254
1144 China CHN 1960 6.086700
1145 Colombia COL 1960 0.437281
1147 Congo, Dem. Rep. COD 1960 0.146524
1148 Congo, Rep. COG 1960 0.150811
1149 Costa Rica CRI 1960 0.126841
1150 Cote d'Ivoire CIV 1960 0.107714
1156 Denmark DNK 1960 0.316624
1159 Dominican Republic DOM 1960 0.238808
1160 Ecuador ECU 1960 0.173455
1161 Egypt, Arab Rep. EGY 1960 1.042268
1169 Fiji FJI 1960 0.278461
1170 Finland FIN 1960 0.279969
1171 France FRA 1960 0.456487
... ... ... ... ...
642281 Suriname SUR 2017 99.665666
642282 Sweden SWE 2017 0.000000
642283 Switzerland CHE 2017 1.377046
642284 Syrian Arab Republic SYR 2017 100.000000
642285 Tajikistan TJK 2017 100.000000
642286 Tanzania TZA 2017 100.000000
642287 Thailand THA 2017 99.594649
642288 Timor-Leste TLS 2017 96.262700
642289 Togo TGO 2017 100.000000
642290 Tonga TON 2017 0.000000
642291 Trinidad and Tobago TTO 2017 100.000000
642292 Tunisia TUN 2017 100.000000
642293 Turkey TUR 2017 99.999170
642294 Turkmenistan TKM 2017 99.960505
642297 Uganda UGA 2017 100.000000
642298 Ukraine UKR 2017 97.898540
642299 United Arab Emirates ARE 2017 100.000000
642300 United Kingdom GBR 2017 1.214196
642301 United States USA 2017 0.277273
642302 Uruguay URY 2017 0.000000
29
642303 Uzbekistan UZB 2017 97.681436
642304 Vanuatu VUT 2017 0.000000
642305 Venezuela, RB VEN 2017 72.091015
642306 Vietnam VNM 2017 99.234229
642307 Virgin Islands (U.S.) VIR 2017 10.000000
642308 West Bank and Gaza PSE 2017 99.999998
642309 Yemen, Rep. YEM 2017 100.000000
642310 Zambia ZMB 2017 100.000000
642311 Zimbabwe ZWE 2017 100.000000
645448 Sudan SDN 2018 0.000000
1.3 Finalizing the cleaning for our study

Our data has improved a lot by now. However, since the number of indicators is still quite large,
let us focus our study on the following indicators for now:
• Total greenhouse gas emissions (kt of CO2 equivalent), EN.ATM.GHGT.KT.CE: The to-
tal of greenhouse emissions includes CO2, Methane, Nitrous oxide, among other pollutant
gases. Measured in kilotons.
• CO2 emissions (kt), EN.ATM.CO2E.KT: Carbon dioxide emissions are those stemming
from the burning of fossil fuels and the manufacture of cement. They include carbon dioxide
produced during consumption of solid, liquid, and gas fuels and gas flaring.
• Methane emissions (kt of CO2 equivalent), EN.ATM.METH.KT.CE: Methane emissions

are those stemming from human activities such as agriculture and from industrial methane
production.
• Nitrous oxide emissions (kt of CO2 equivalent), EN.ATM.NOXE.KT.CE: Nitrous oxide

emissions are emissions from agricultural biomass burning, industrial activities, and live-
stock management.
• Other greenhouse gas emissions, HFC, PFC and SF6 (kt of CO2 equivalent),
EN.ATM.GHGO.KT.CE: Other pollutant gases.
• PM2.5 air pollution, mean annual exposure (micrograms per cubic meter),
EN.ATM.PM25.MC.M3: Population-weighted exposure to ambient PM2.5 pollution is
defined as the average level of exposure of a nation’s population to concentrations of
suspended particles measuring less than 2.5 microns in aerodynamic diameter, which are
capable of penetrating deep into the respiratory tract and causing severe health damage.
Exposure is calculated by weighting mean annual concentrations of PM2.5 by population in
both urban and rural areas.
• PM2.5 air pollution, population exposed to levels exceeding WHO guideline value (% of
total), EN.ATM.PM25.MC.ZS: Percent of population exposed to ambient concentrations of
PM2.5 that exceed the World Health Organization (WHO) guideline value.
30
1.3.1 Exercise 6:
For each of the emissions DataFrames, extract the rows corresponding to the above indicators of
interest. Replace the long names of the indicators by the short names Total, CO2, CH4, N2O, Other,
PM2.5, and PM2.5_WHO. (This will be helpful later when we need to label plots of our data.)
Answer.
In [18]: indicators_dict = {
"EN.ATM.GHGT.KT.CE": "Total",
"EN.ATM.CO2E.KT": "C02",
"EN.ATM.METH.KT.CE": "CH4",
"EN.ATM.NOXE.KT.CE": "N20",
"EN.ATM.GHGO.KT.CE": "Other",
"EN.ATM.PM25.MC.M3": "PM2.5",
"EN.ATM.PM25.MC.ZS": "PM2.5WHO"
}
l = [indicators_dict['EN.ATM.PM25.MC.M3']]
In [19]: #Extract desired codes

emit_R_df = Emissions_R_df[
Emissions_R_df["Indicator Code"].isin(indicators_dict.keys())
].copy()
#Clean up names by shortening description

emit_R_df["Indicator Code"] = emit_R_df["Indicator Code"].apply(
lambda x: indicators_dict[x])
#emit_R_df
#Extract desired codes

emit_C_df = Emissions_C_df[
Emissions_C_df["Indicator Code"].isin(indicators_dict.keys())
].copy()
#Clean up names by shortening description

emit_C_df["Indicator Code"] = emit_C_df["Indicator Code"].apply(
lambda x: indicators_dict[x])
#emit_C_df
In [20]: ePMc = emit_C_df[emit_C_df['Indicator Code'].isin(l)].copy()

#ePMc
A dataframe of the indicators by year and country with PM2.5WHO isolated.
1.4 Where shall the client start environmental campaigns?

Now the DataFrames Emissions_C_df and Emissions_R_df seem to be in a good shape. Let’s
proceed to conduct some exploratory data analysis so that we can make recommendations to our
client.
31
1.4.1 Exercise 7:
Let’s first calculate some basic information about the main indicators across the globe.
7.1 Compute some basic statistics of the amount of kt of emissions for each of the four main
pollutants (CO2, CH4, N2O, Others) over the years. Use the Emissions_C_df data frame. What
trends do you see?
Answer.
In [21]: C02 = ["C02"]

CH4 = ["CH4"]
N20 = ["N20"]
Other = ["Other"]
Pollutants = [C02[0],CH4[0],N20[0],Other[0]]
2 C02
In [22]: mainCC02 = emit_C_df[emit_C_df["Indicator Code"].isin(C02)].copy()
cc02=mainCC02["Indicator Value"].describe()
#print (np.std(mainC02["ktEmit"]))
#print (np.mean(mainC02["ktEmit"]))
#print (np.var(mainC02["ktEmit"]))
cc02
Out[22]: count 9.856000e+03

mean 1.004811e+05
std 4.950942e+05
min -8.067400e+01
25% 5.573840e+02
50% 4.275722e+03
75% 4.008581e+04
max 1.029193e+07
Name: Indicator Value, dtype: float64
3 CH4
In [23]: mainCCH4 = emit_C_df[emit_C_df["Indicator Code"].isin(CH4)].copy()
cch4=mainCCH4["Indicator Value"].describe()
#print (np.std(mainCH4["ktEmit"]))
#print (np.mean(mainCH4["ktEmit"]))
#print (np.var(mainCH4["ktEmit"]))
cch4
Out[23]: count 8.736000e+03

mean 3.190019e+04
std 1.049856e+05
min 0.000000e+00
32
25% 8.806213e+02
50% 5.457505e+03
75% 1.932534e+04
max 1.752290e+06
4 N20
In [24]: mainN20 = emit_C_df[emit_C_df["Indicator Code"].isin(N20)].copy()
n20=mainN20["Indicator Value"].describe()
#print (np.std(mainN20["ktEmit"]))
#print (np.mean(mainN20["ktEmit"]))
#print (np.var(mainN20["ktEmit"]))
n20
Out[24]: count 8779.000000

mean 13575.872976
std 41248.850927
min 0.000000
25% 291.106585
50% 2499.294400
75% 8913.466837
max 587166.365500
5 Other
In [25]: mainOther = emit_C_df[emit_C_df["Indicator Code"].isin(Other)].copy()
other=mainOther["Indicator Value"].describe()
#print (np.std(mainOthers["ktEmit"]))
#print (np.mean(mainOthers["ktEmit"]))
#print (np.var(mainOthers["ktEmit"]))
other
Out[25]: count 7.971000e+03

mean 3.082499e+04
std 1.321496e+05
min -3.262726e+05
25% 7.548464e+00
50% 8.432500e+02
75% 1.075486e+04
max 3.484920e+06
Consider plotting mean over time.
In [26]: pollutantsreg = emit_C_df[emit_C_df["Indicator Code"].isin(Pollutants)].copy()

#pollreg=pollutantsreg["Indicator Value"].mean()
33
#pollutantsreg
avpoll=pollutantsreg.groupby(['Year', 'Indicator Code']).mean()['Indicator Value']
avpoll
Out[26]: Year Indicator Code

1960 C02 43864.241159
1961 C02 43016.467250
1962 C02 43651.539390
1963 C02 45599.499871
1964 C02 46178.029919
1965 C02 48562.081000
1966 C02 51025.211733
1967 C02 52682.012441
1968 C02 55899.497460
1969 C02 60002.642696
1970 C02 65238.044712
CH4 26057.280609
N20 10789.454327
Other 23037.898904
1971 C02 67113.992994
CH4 25265.323492
N20 10116.866672
Other 17694.695700
1972 C02 69707.129608
CH4 26324.434722
N20 10877.512100
Other 21556.652755
1973 C02 73630.223169
CH4 26622.476042
N20 11093.793899
Other 21009.787311
1974 C02 72886.926687
CH4 26411.137428
N20 10850.954334
Other 18258.766077
...
2006 C02 142007.697396
CH4 37435.649878
N20 15097.413940
Other 40121.414471
2007 C02 145367.159808
CH4 37827.832122
N20 15881.832072
Other 39282.106526
2008 C02 149633.470443
CH4 37410.279986
N20 14721.050659
Other 34314.859442
34
2009 C02 148186.017030
CH4 38210.173914
N20 14944.935049
Other 32439.979269
2010 C02 155782.613970
CH4 37883.007264
N20 14832.660386
Other 38139.160139
2011 C02 161941.999808
CH4 38957.611430
N20 15155.798583
Other 44330.506322
2012 C02 162632.874078
CH4 39385.301627
N20 15300.954568
Other 46333.117526
2013 C02 163074.533966
2014 C02 165114.116327
Name: Indicator Value, Length: 184, dtype: float64
In [27]: av=avpoll.unstack('Indicator Code').plot()

#avpoll=sns.lineplot(data=pollutantsreg, x='Year', y='Indicator Value')
plt.ylabel('Indicator Value', fontsize=12)
plt.xlabel('Year', fontsize=12)
#plt.setp(av.get_yticklabels()[::], visible=False)
#avpoll.xticks(gr.get_xticks()[1::3])
av.set_yticklabels(av.get_yticklabels(), fontsize=12);
av.set_xticklabels(av.get_xticklabels(), fontsize=10, rotation=45);
av.set_xlabel(av.get_xlabel(), fontsize=20);
av.set_ylabel(av.get_ylabel(), fontsize=20);
plt.xticks(np.arange(5),['1960','1970','1980','1990','2000','2010'])
plt.xticks(np.arange(0,55,10))
plt.yticks(np.arange(4),['0','50000','100000','150000'])
plt.yticks(np.arange(0,170000,50000))
av
Out[27]: <matplotlib.axes._subplots.AxesSubplot at 0x7f7b8c7f17b8>
35
7.2 What can you say about the distribution of emissions around the globe over the years? What
information can you extract from the tails of these distributions over the years?
Answer.
In [28]: #emit_C_df
mainpoll_graph_df1 = emit_R_df[emit_R_df[“Indicator Code”].isin(Pollutants)].copy()

mainpoll_graph_df2 = mainpoll_graph_df1[mainpoll_graph_df1[“Country
Code”].isin(glRegions)].copy() mainpoll_graph_df2
In [29]: LCN=["LCN"]
SAS=["SAS"]
SSF=["SSF"]
ECS=["ECS"]
MEA=["MEA"]
EAS=["EAS"]
NAC=["NAC"]
glRegions = [LCN[0],SAS[0],SSF[0],ECS[0],MEA[0],EAS[0],NAC[0]]
6 C02
In [30]: graphRC02i = emit_R_df[emit_R_df["Indicator Code"].isin(C02)].copy()
graphRC02 = graphRC02i[graphRC02i["Country Code"].isin(glRegions)].copy()
36
grC02=sns.lineplot(data=graphRC02, x='Year', y='Indicator Value',hue="Country Code")
grC02.set_xticks(grC02.get_xticks()[0::10])
grC02.set_xlabel(grC02.get_xlabel(), fontsize=24);
grC02.set_ylabel(grC02.get_ylabel(), fontsize=24);
plt.show()
7 CH4
In [31]: graphRCH4i = emit_R_df[emit_R_df["Indicator Code"].isin(CH4)].copy()
graphRCH4 = graphRCH4i[graphRCH4i["Country Code"].isin(glRegions)].copy()
grCH4=sns.lineplot(data=graphRCH4, x='Year', y='Indicator Value',hue="Country Code")
grCH4.set_xticks(grCH4.get_xticks()[0::10])
grCH4.set_xlabel(grCH4.get_xlabel(), fontsize=24);
grCH4.set_ylabel(grCH4.get_ylabel(), fontsize=24);
plt.show()
37
8 N20
In [32]: graphRN20i = emit_R_df[emit_R_df["Indicator Code"].isin(N20)].copy()
graphRN20 = graphRN20i[graphRN20i["Country Code"].isin(glRegions)].copy()
grN20=sns.lineplot(data=graphRN20, x='Year', y='Indicator Value',hue="Country Code")
grN20.set_xticks(grN20.get_xticks()[0::10])
grN20.set_xlabel(grN20.get_xlabel(), fontsize=24);
grN20.set_ylabel(grN20.get_ylabel(), fontsize=24);
plt.show()
38
9 Other
In [33]: graphROtheri = emit_R_df[emit_R_df["Indicator Code"].isin(Other)].copy()
graphROther = graphROtheri[graphROtheri["Country Code"].isin(glRegions)].copy()
grOther=sns.lineplot(data=graphROther, x='Year', y='Indicator Value',hue="Country Code
grOther.set_xticks(grOther.get_xticks()[0::10])
grOther.set_xlabel(grOther.get_xlabel(), fontsize=24);
grOther.set_ylabel(grOther.get_ylabel(), fontsize=24);
plt.show()
39
In [34]: #leaving this here because it's pretty and I may want to use it later.
fig = plt.figure()
ax = plt.axes(projection='3d')
40
7.3 Compute a plot showing the behavior of each of the four main air pollutants for each of the
main global regions in the Emissions_R_df data frame. The main regions are 'Latin America &
Caribbean', 'South Asia', 'Sub-Saharan Africa', 'Europe & Central Asia', 'Middle
East & North Africa', 'East Asia & Pacific' and 'North America'. What conclusions can
you make?
Answer.
In [35]: mainR=emit_R_df.melt(id_vars=['Indicator Code', 'Country Code'],
value_vars=WDI_data.columns[4:65],
var_name='Year',
value_name='ktEmit')
#mainR
10 Latin America & Caribbean

In [36]: mainRLCN = emit_R_df[emit_R_df["Country Code"].isin(LCN)].copy()
mainRLCNemit=mainRLCN[mainRLCN["Indicator Code"].isin(Pollutants)].copy()
mainRLCNemit
figLCN = plt.subplots(figsize=(30, 15))
lcn=sns.swarmplot(y="Indicator Value", x="Year", hue="Indicator Code", data=mainRLCNem
lcn.set_xticklabels(lcn.get_xticklabels(), rotation=45, fontsize=16);
lcn.set_xlabel(lcn.get_xlabel(), fontsize=24);
lcn.set_ylabel(lcn.get_ylabel(), fontsize=24);
lcn.set_ylim(0,15000000)
plt.gcf().axes[0].yaxis.get_major_formatter().set_scientific(False)
leglcn = plt.legend(loc=9, ncol=2, shadow=True, fancybox=True, fontsize='xx-large')
#plt.xtick('xtick', labelsize=22) # fontsize of the tick labels
#plt.rc('ytick', labelsize=18) # fontsize of the tick labels
#plt.rcdefaults()
41
11 South Asia
In [37]: mainRSAS = emit_R_df[emit_R_df["Country Code"].isin(SAS)].copy()
mainRSASemit=mainRSAS[mainRSAS["Indicator Code"].isin(Pollutants)].copy()
mainRSASemit
figSAS = plt.subplots(figsize=(30, 15))
sas=sns.swarmplot(y="Indicator Value", x="Year", hue="Indicator Code", data=mainRSASem
sas.set_xticklabels(sas.get_xticklabels(), rotation=45, fontsize=16);
sas.set_xlabel(sas.get_xlabel(), fontsize=24);
sas.set_ylabel(sas.get_ylabel(), fontsize=24);
sas.set_ylim(0,15000000)
legsas = plt.legend(loc=9, ncol=2, shadow=True, fancybox=True, fontsize='xx-large')
12 Sub-Saharan Africa
In [38]: mainRSSF = emit_R_df[emit_R_df["Country Code"].isin(SSF)].copy()
mainRSSFemit=mainRSSF[mainRSSF["Indicator Code"].isin(Pollutants)].copy()
mainRSSFemit
figSSF = plt.subplots(figsize=(30, 15))
ssf=sns.swarmplot(y="Indicator Value", x="Year", hue="Indicator Code", data=mainRSSFem
ssf.set_xticklabels(ssf.get_xticklabels(), rotation=45, fontsize=16);
ssf.set_xlabel(ssf.get_xlabel(), fontsize=24);
ssf.set_ylabel(ssf.get_ylabel(), fontsize=24);
ssf.set_ylim(0,15000000)
legssf = plt.legend(loc=9, ncol=2, shadow=True, fancybox=True, fontsize='xx-large')
42
13 Europe & Central Asia
In [39]: mainRECS = emit_R_df[emit_R_df["Country Code"].isin(ECS)].copy()
mainRECSemit=mainRECS[mainRECS["Indicator Code"].isin(Pollutants)].copy()
mainRECSemit
figECS = plt.subplots(figsize=(30, 15))
ecs=sns.swarmplot(y="Indicator Value", x="Year", hue="Indicator Code", data=mainRECSem
ecs.set_xticklabels(ecs.get_xticklabels(), rotation=45, fontsize=16);
ecs.set_xlabel(ecs.get_xlabel(), fontsize=24);
ecs.set_ylabel(ecs.get_ylabel(), fontsize=24);
ecs.set_ylim(0,15000000)
legecs = plt.legend(loc=9, ncol=2, shadow=True, fancybox=True, fontsize='xx-large')
43
14 Middle East & North Africa
In [40]: mainRMEA = emit_R_df[emit_R_df["Country Code"].isin(MEA)].copy()
mainRMEAemit=mainRMEA[mainRMEA["Indicator Code"].isin(Pollutants)].copy()
mainRMEAemit
figMEA = plt.subplots(figsize=(30, 15))
mea=sns.swarmplot(y="Indicator Value", x="Year", hue="Indicator Code", data=mainRMEAem
mea.set_xticklabels(mea.get_xticklabels(), rotation=45, fontsize=16);
mea.set_xlabel(mea.get_xlabel(), fontsize=24);
mea.set_ylabel(mea.get_ylabel(), fontsize=24);
mea.set_ylim(0,15000000)
legmea = plt.legend(loc=9, ncol=2, shadow=True, fancybox=True, fontsize='xx-large')
44
15 East Asia & Pacific
In [41]: mainREAS = emit_R_df[emit_R_df["Country Code"].isin(EAS)].copy()
mainREASemit=mainREAS[mainREAS["Indicator Code"].isin(Pollutants)].copy()
mainREASemit
figEAS = plt.subplots(figsize=(30, 15))
eas=sns.swarmplot(y="Indicator Value", x="Year", hue="Indicator Code", data=mainREASem
eas.set_xticklabels(eas.get_xticklabels(), rotation=45, fontsize=16);
eas.set_ylim(0,15000000)
eas.set_xlabel(eas.get_xlabel(), fontsize=24);
eas.set_ylabel(eas.get_ylabel(), fontsize=24);
legeas = plt.legend(loc=9, ncol=2, shadow=True, fancybox=True, fontsize='xx-large')
45
16 North America
In [42]: mainRNAC = emit_R_df[emit_R_df["Country Code"].isin(NAC)].copy()
mainRNACemit=mainRNAC[mainRNAC["Indicator Code"].isin(Pollutants)].copy()
mainRNACemit
figNAC = plt.subplots(figsize=(30, 15))
nac=sns.swarmplot(y="Indicator Value", x="Year", hue="Indicator Code", data=mainRNACem
nac.set_xticklabels(nac.get_xticklabels(), rotation=45, fontsize=16);
nac.set_xlabel(nac.get_xlabel(), fontsize=24);
nac.set_ylabel(nac.get_ylabel(), fontsize=24);
nac.set_ylim(0,15000000)
legnac = plt.legend(loc=9, ncol=2, shadow=True, fancybox=True, fontsize='xx-large')
46
It seems that countries in East Asia and the Pacific are the worst dealing with pollutant emis-
sions. We also see that Europe and Central Asia have been making some efforts to reduce their
emissions. Surprisingly this is not the case with North America and Sub-Saharan Africa, which
levels have been increasing over the years as well.
16.0.1 Exercise 8:
In Exercise 7 we discovered some interesting features of the distribution of the emissions over the
years. Let us explore these features in more detail.
8.1 Which are the top five countries that have been in the top 10 of CO2 emitters over the years?
Have any of these countries made efforts to reduce the amount of CO2 emissions over the last 10
years?
Answer.
In [43]: #mainCC02
In [44]: mainCC02.astype({'Year': 'int64'}).dtypes
Out[44]: Indicator Code object

Indicator Name_x object
Country Name object
Country Code object
Year int64
Indicator Value float64
dtype: object
In [45]: newCC02=mainCC02.groupby([mainCC02.Year]).apply(lambda grp: grp.nlargest(10, 'Indicato
47
In [46]: n = 5
CC02list=newCC02['Country Name'].value_counts()[:n].index.tolist()
CC02list
Out[46]: ['China', 'Japan', 'India', 'United States', 'Canada']
In [47]: n = 5
fiveCC02=newCC02[newCC02["Country Name"].isin(newCC02['Country Name'].value_counts()[:
#fiveCC02
In [48]: graphCC02=sns.lineplot(data=fiveCC02.tail(49), x='Year', y='Indicator Value', hue='Cou

graphCC02
Out[48]: <matplotlib.axes._subplots.AxesSubplot at 0x7f7b89dbeb70>
17 Answer
The 5 countries who most frequently made the top 10 emitters lists over the years were China, the
United States, Japan, India, and Canada. The United States decreased their C02 emissions over
the 10 year window. Canada and Japan saw minimal change to their C02 emissions during that
same window. India and China saw increases to their C02 emissions during the window.
48
8.2 Are these five countries carrying out the burden of most of the emissions emitted over the
years globally? Can we say that the rest of the world is making some effort to control their polluted
gasses emissions over the years?
Answer.
What percent of total emissions are these 5 countries responsible for? What is the change in
emissions over time for the other countries? We need to create a separate dataframe excluding
these 5 countries.
In [49]: #EN.ATM.GHGT.KT.CE
#emit_C_df
tot=['Total']
Sum of all emissions of type “Total” for all countries and years.
In [50]: alltot=emit_C_df[emit_C_df["Indicator Code"].isin(tot).copy()]

allsum=alltot["Indicator Value"].sum()
allsum
Out[50]: 1597525708.0684905
In [51]: total5emit=alltot[alltot["Country Name"].isin(CC02list).copy()]

#total5emit
Sum of all emissions of type “Total” for the top 5 countries for all years.
In [52]: fivesum=total5emit["Indicator Value"].sum()

fivesum
Out[52]: 631799452.38203335
Percent of global emissions over the years for which the top 5 countries are responsible.
In [53]: burden=(fivesum/allsum)*100
print (str(burden)+"%")
39.5486250513%
In [54]: restworld=alltot[(alltot["Country Name"].isin(CC02list)==False).copy()]

#restworld
Graph of total emissions over the years for all countries excluding the top 5.
In [55]: graphrest=sns.lineplot(data=restworld, x='Year', y='Indicator Value')

#.set_xticklabels(restworld.Year, rotation=45)
xaxis=graphrest.xaxis.get_major_ticks()
for i in range(len(xaxis)):
if i%10!=0:
xaxis[i].set_visible(False)
graphrest
49
Out[55]: <matplotlib.axes._subplots.AxesSubplot at 0x7f7b89db0f28>
In [56]: print (alltot["Country Name"].nunique())

print (str(float((5/199)*100))+"%")
199
2.512562814070352%
18 Answer
A. At first glance, I would have argued that these countries are not responsible for most of the
emissions because they’re only responsible for 39.5% of the total emissions. However, when con-
sidering that over a third of total emissions are coming from less than 3% of the total countries, I
would definitely say that these five coutntries shoulder the greatest burden of the emissions.
B. I cannot say that the rest of the world is doing much better on controlling emissions because
there has been an overall consistent upward trend in total polluted gasses over the year.
18.1 The health impacts of air pollution

18.1.1 Exercise 9:
One of the main contributions of poor health from air pollution is particulate matter. In particular,
very small particles (those with a size less than 2.5 micrometres (µm)) can enter and affect the
respiratory system. The PM2.5 indicator measures the average level of exposure of a nation’s
50
population to concentrations of these small particles. The PM2.5_WHO measures the percentage of
the population who are exposed to ambient concentrations of these particles that exceed some
thresholds set by the World Health Organization (WHO). In particular, countries with a higher
PM2.5_WHO indicator are more likely to suffer from bad health conditions.
9.1 The client would like to know if there is any relationship between the PM2.5_WHO indicator
and the level of income of the general population, as well as how this changes over time. What
plot(s) might be helpful to solve the client’s question? What conclusion can you draw from your
plot(s) to answer their question?
Hint: The DataFrame WDI_countries contains a column named Income Group.
Answer.
In [57]: pm2=['PM2.5WHO']
pm25_C_df = emit_C_df[emit_C_df['Indicator Code'].isin(pm2)].copy()
#pm25_C_df
In [58]: WDI_PM2 = WDI_Country.copy() #copying first dataframe to enable merge
#WDI_data_copy.head() #printing the data frame
#WDI_Country.set_index(['Country Code','Country Name'])['Year'].unstack()
#WDI_Country
PM25WHO = pd.merge(WDI_PM2, pm25_C_df, how='inner', on='Country Code')
#PM25WHO
In [59]: figPM25 = plt.subplots(figsize=(30, 15))
graphpm=sns.barplot(y="Indicator Value", x="Year", hue="Income Group", data=PM25WHO)
graphpm.set_xlabel(graphpm.get_xlabel(), fontsize=24);
graphpm.set_ylabel(graphpm.get_ylabel(), fontsize=24);
legpm = plt.legend(ncol=2, shadow=True, fancybox=True, fontsize='xx-large')
graphpm
Out[59]: <matplotlib.axes._subplots.AxesSubplot at 0x7f7b89c6d710>
51
9.2 What do you think are the causes behind the results in Exercise 9.1?
Answer.
Air quality in lower income communities would be lower because of miminal fresh air and
oxygen from greenery. Also, public transportation is likely to be utilized heavier in low income
areas. Lastly, high income citizens tend to live in more spacious communities; thus decreasing the
density of pollutants in relation to area.
18.1.2 Exercise 10:

Finally, our client is interested in investigating the impacts and relationships between high levels
of exposure to particle matter and the health of the population. Coming up with additional data
for this task may be infeasible for the client, thus they have asked us to search for relevant health
data in the WDIdata.csv file and work with that.
10.1 Which indicators present in the file WDISeries.csv file might be useful to solve the client’s
question? Explain.
Note: Naming one or two indicators is more than enough for this question.
Answer.
In [60]: apmfe = ['SH.STA.AIRP.FE.P5']

apmma = ['SH.STA.AIRP.MA.P5']
apm = ['SH.STA.AIRP.P5']
apmTotal = [apmfe, apmma, apm]
In [61]: WID=WDI_ids.copy()
WDI_ids_health1 = WID[(WID['Long definition'].str.contains('pollution'))]
WDI_ids_health2 = WDI_ids_health1[(WDI_ids_health1['Topic'].str.contains('Health'))]
WDI_ids_health = WDI_ids_health2[WDI_ids_health2['Series Code'].isin(apm)].copy()
WDI_ids_health
Out[61]: Series Code Topic \

986 SH.STA.AIRP.P5 Health: Mortality
Indicator Name Short definition \

986 Mortality rate attributed to household and amb... NaN
Long definition Unit of measure \

986 Mortality rate attributed to household and amb... NaN
Periodicity Base Period Other notes Aggregation method ... \

Notes from original source General comments \

986 NaN NaN
Source \
986 World Health Organization, Global Health Obser...
52
Statistical concept and methodology \
986 NaN
Development relevance Related source links \

986 Air pollution is one of the biggest environmen... NaN
Other web links Related indicators License Type Unnamed: 20

In [62]: pd.options.display.max_colwidth=1000
print (str(WDI_ids_health['Long definition']))
#indname=[WDI_ids_health['Indicator Name']]
#[print(x) for x in indname]
#WDI_ids_sub[['Main Topic','Subtopic']] = WDI_ids_sub.Topic.str.split(':',expand = Tru
#WDI_ids_sub #Print content of dataframe.
986 Mortality rate attributed to household and ambient air pollution is the number of deaths
Name: Long definition, dtype: object
Answer.
1. Mortality rate attributed to household and ambient air pollution, age-standardized, female
(per 100,000 female population)
2. Mortality rate attributed to household and ambient air pollution, age-standardized, male
(per 100,000 male population)
3. Mortality rate attributed to household and ambient air pollution, age-standardized (per
100,000 population)
(will only graph #3)
10.2 Use the indicators provided in Exercise 10.1 to give valuable information to the client.
Answer.
In [63]: countries = WDI_Country['Country Code'].tolist()

#countries
In [64]: mortalityMerge1 = pd.merge(WDI_data, WDI_ids_health,how='inner',left_on='Indicator Cod

mortality_melt=mortalityMerge1.melt(id_vars=['Indicator Code','Indicator Name_x', 'Cou
value_vars=mortalityMerge1.columns[4:65],
var_name='Year',
value_name='Indicator Value')
mortality_melt.dropna(inplace=True)
mort_countries = mortality_melt[mortality_melt['Country Code'].isin(Region_Array)==Fal
#mort_countries
In [65]: mor_count=mort_countries.groupby(['Country Name'])

#mor_count
53
In [66]: mergedapm = world.set_index('iso_a3').join(mort_countries.set_index('Country Code'))
mergedapm = mergedapm.reset_index()
mergedapm = mergedapm.fillna(0)
#mergedapm
In [67]: fig, ax = plt.subplots(1, figsize=(40, 20))

ax.axis('off')
ax.set_title('Heat Map of Indicator Values Globally', fontdict={'fontsize': '40', 'fon
color = 'Reds'
vmin, vmax = 0, 231
sm = plt.cm.ScalarMappable(cmap=color, norm=plt.Normalize(vmin=vmin, vmax=vmax))
sm._A = []
cbar = fig.colorbar(sm)
cbar.ax.tick_params(labelsize=20)
mergedapm.plot('Indicator Value', cmap=color, linewidth=0.8, ax=ax, edgecolor='0.8',fi
Out[67]: <matplotlib.axes._subplots.AxesSubplot at 0x7f7b89a94550>
10.3 Extend the analysis above to find some countries of interest. These are defined as
• The countries that have a high mortality rate due to household and ambient air pollution,
but with low PM2.5 exposure
• The countries that have a low mortality rate due to household and ambient air pollution, but
with high PM2.5 exposure
19 Approach #1 - Compare Heat Maps

Answer.
54
In [68]: ePMc.rename(columns={"Indicator Value": "Indicator Value-PM2.5"})
Out[68]: Indicator Code \

341663 PM2.5
341664 PM2.5
341665 PM2.5
341666 PM2.5
341667 PM2.5
341668 PM2.5
341669 PM2.5
341670 PM2.5
341671 PM2.5
341673 PM2.5
341674 PM2.5
341675 PM2.5
341676 PM2.5
341677 PM2.5
341678 PM2.5
341679 PM2.5
341680 PM2.5
341681 PM2.5
341682 PM2.5
341683 PM2.5
341684 PM2.5
341685 PM2.5
341686 PM2.5
341687 PM2.5
341688 PM2.5
341689 PM2.5
341691 PM2.5
341692 PM2.5
341693 PM2.5
341694 PM2.5
... ...
641224 PM2.5
641225 PM2.5
641226 PM2.5
641227 PM2.5
641228 PM2.5
641229 PM2.5
641230 PM2.5
641231 PM2.5
641232 PM2.5
641233 PM2.5
641234 PM2.5
641235 PM2.5
641236 PM2.5
641237 PM2.5
55
641238 PM2.5
641241 PM2.5
641242 PM2.5
641243 PM2.5
641244 PM2.5
641245 PM2.5
641246 PM2.5
641247 PM2.5
641248 PM2.5
641249 PM2.5
641250 PM2.5
641251 PM2.5
641252 PM2.5
641253 PM2.5
641254 PM2.5
641255 PM2.5
Indicator Name_x \
341663 PM2.5 air pollution, mean annual exposure (micrograms per cubic meter)
56
... ...
Country Name Country Code Year Indicator Value-PM2.5

341663 Afghanistan AFG 1990 65.486792
341664 Albania ALB 1990 22.718663
341665 Algeria DZA 1990 36.930258
341666 American Samoa ASM 1990 12.479083
341667 Andorra AND 1990 14.118742
341668 Angola AGO 1990 36.448116
341669 Antigua and Barbuda ATG 1990 22.408564
341670 Argentina ARG 1990 16.220530
341671 Armenia ARM 1990 36.860721
341673 Australia AUS 1990 10.411855
341674 Austria AUT 1990 16.866688
341675 Azerbaijan AZE 1990 23.942915
341676 Bahamas, The BHS 1990 21.019924
341677 Bahrain BHR 1990 68.158762
341678 Bangladesh BGD 1990 61.595630
57
341679 Barbados BRB 1990 28.212522
341680 Belarus BLR 1990 24.585848
341681 Belgium BEL 1990 17.587496
341682 Belize BLZ 1990 27.429802
341683 Benin BEN 1990 40.220136
341684 Bermuda BMU 1990 11.749759
341685 Bhutan BTN 1990 40.205969
341686 Bolivia BOL 1990 23.952679
341687 Bosnia and Herzegovina BIH 1990 30.352558
341688 Botswana BWA 1990 25.435431
341689 Brazil BRA 1990 15.143861
341691 Brunei Darussalam BRN 1990 6.732298
341692 Bulgaria BGR 1990 27.559881
341693 Burkina Faso BFA 1990 46.014661
341694 Burundi BDI 1990 41.248888
... ... ... ... ...
641224 Sudan SDN 2017 55.370834
641225 Suriname SUR 2017 24.780011
641226 Sweden SWE 2017 6.184665
641227 Switzerland CHE 2017 10.303100
641228 Syrian Arab Republic SYR 2017 43.757259
641229 Tajikistan TJK 2017 46.152185
641230 Tanzania TZA 2017 29.076641
641231 Thailand THA 2017 26.256727
641232 Timor-Leste TLS 2017 19.257209
641233 Togo TGO 2017 35.731336
641234 Tonga TON 2017 10.785479
641235 Trinidad and Tobago TTO 2017 24.108568
641236 Tunisia TUN 2017 37.655994
641237 Turkey TUR 2017 44.311526
641238 Turkmenistan TKM 2017 21.767721
641241 Uganda UGA 2017 50.494321
641242 Ukraine UKR 2017 20.309776
641243 United Arab Emirates ARE 2017 40.917510
641244 United Kingdom GBR 2017 10.472690
641245 United States USA 2017 7.409442
641246 Uruguay URY 2017 9.274883
641247 Uzbekistan UZB 2017 28.455901
641248 Vanuatu VUT 2017 11.652777
641249 Venezuela, RB VEN 2017 17.008554
641250 Vietnam VNM 2017 29.626728
641251 Virgin Islands (U.S.) VIR 2017 10.265312
641252 West Bank and Gaza PSE 2017 33.225630
641253 Yemen, Rep. YEM 2017 50.456007
641254 Zambia ZMB 2017 27.438035
641255 Zimbabwe ZWE 2017 22.251671
58
In [69]: mergedpm2 = pd.merge(mergedapm, ePMc, left_on='index', right_on='Country Code', how='o
mergedpm2 = mergedpm2.reset_index()
mergedpm2 = mergedpm2.fillna(0)
mergedpm2 = mergedpm2.drop(columns=['Country Name_x','Country Name_y','level_0','Count
mergedpm2 = mergedpm2.rename(columns={'index':'Country Code', 'Year_x': 'APM Year', 'Y
#mergedpm2.sample(2)

ax.axis('off')
ax.set_title('Heat Map of APM Indicator Values Globally', fontdict={'fontsize': '40',
ax.set_aspect('equal')
color = 'Reds'
vmin, vmax = 0, 231
sm = plt.cm.ScalarMappable(cmap=(color), norm=plt.Normalize(vmin=vmin, vmax=vmax))
sm._A = []
mergedapm.plot('Indicator Value', cmap='Reds', linewidth=0.8, ax=ax, edgecolor='0.8',
plt.show()

ax.axis('off')
ax.set_title('Heat Map of PM2.5 Indicator Values Globally', fontdict={'fontsize': '40'
ax.set_aspect('equal')
color = 'Blues'
vmin, vmax = 0, 231
sm = plt.cm.ScalarMappable(cmap=(color), norm=plt.Normalize(vmin=vmin, vmax=vmax))
59
sm._A = []
mergedpm2.plot('PM2.5 Ind. Value', cmap='Blues', edgecolor='k', linewidth = 1, ax=ax)
plt.show()
20 Approach #2 - Both graphs on one plane (quantiles and mean).

In [72]: from splot.mapping import vba_choropleth
In [73]: fig, ax = plt.subplots(1,2, figsize=(40, 20))

#ax.axis('off')
#ax.set_title('Heat Map of Indicator Values Globally', fontdict={'fontsize': '40', 'fo
#color = 'Reds'
#vmin, vmax = 0, 231
#sm = plt.cm.ScalarMappable(cmap=(color), norm=plt.Normalize(vmin=vmin, vmax=vmax))
#sm._A = []
#cbar = fig.colorbar(sm)
#cbar.ax.tick_params(labelsize=20)
#mergedpm2.plot(column='Country Code', scheme='quantiles', cmap='RdBu', ax=ax[0])
x=mergedpm2['APM Ind. Value'].values
y=mergedpm2['PM2.5 Ind. Value'].values
vba_choropleth(x,y,mergedpm2,rgb_mapclassify=dict(classifier='quantiles'), alpha_mapcl
vba_choropleth(x,y,mergedpm2,rgb_mapclassify=dict(classifier='std_mean'), alpha_mapcla
plt.show()
60
21 Approach #3 - Establish middles and generate lists based on these
parameters.
In [74]: PM2med=mergedpm2['PM2.5 Ind. Value'].median()
APMmed=mergedapm['Indicator Value'].median()
PM2med, APMmed
Out[74]: (23.756718672876701, 63.899999999999999)
In [75]: pmlowapmhigh=mergedpm2[mergedpm2['PM2.5 Ind. Value'] <= PM2med][mergedpm2['APM Ind. Va
pmhighapmlow=mergedpm2[mergedpm2['PM2.5 Ind. Value'] > PM2med][mergedpm2['APM Ind. Val
#pmlowapmhigh.head()
pmlowapmhighlist=[pmlowapmhigh['Country Code'].unique()]
pmlowapmhighlist
Out[75]: [array(['ALB', 'BLZ', 'BWA', 'CIV', 'FJI', 'GEO', 'GIN', 'GUY', 'HTI',
'IDN', 'KGZ', 'LBR', 'LKA', 'MDA', 'MDG', 'MNE', 'MOZ', 'MWI',
'PHL', 'PNG', 'SLB', 'SLE', 'SWZ', 'TKM', 'TLS', 'UKR', 'VUT', 'ZWE'], dtype=o
In [76]: pmhighapmlowlist=[pmhighapmlow['Country Code'].unique()]
pmhighapmlowlist
Out[76]: [array(['ARE', 'ARM', 'AZE', 'BGR', 'BLR', 'BOL', 'CHL', 'CUB', 'CZE',
'DZA', 'HND', 'IRN', 'ISR', 'JOR', 'KOR', 'LBN', 'MAR', 'MEX',
'OMN', 'PER', 'POL', 'PSE', 'QAT', 'SLV', 'SRB', 'SUR', 'SVK',
'THA', 'TTO', 'TUN', 'TUR', 0], dtype=object)]
22 Final answer: List of countries of interest.

In [77]: countriesInterest=pd.merge(pmlowapmhigh,pmhighapmlow,how='outer', on='Country Code')
countriesInterest['Country Code'].unique()
Out[77]: array(['ALB', 'BLZ', 'BWA', 'CIV', 'FJI', 'GEO', 'GIN', 'GUY', 'HTI',
'IDN', 'KGZ', 'LBR', 'LKA', 'MDA', 'MDG', 'MNE', 'MOZ', 'MWI',
'PHL', 'PNG', 'SLB', 'SLE', 'SWZ', 'TKM', 'TLS', 'UKR', 'VUT',
'ZWE', 'ARE', 'ARM', 'AZE', 'BGR', 'BLR', 'BOL', 'CHL', 'CUB',
'CZE', 'DZA', 'HND', 'IRN', 'ISR', 'JOR', 'KOR', 'LBN', 'MAR',
'MEX', 'OMN', 'PER', 'POL', 'PSE', 'QAT', 'SLV', 'SRB', 'SUR',
'SVK', 'THA', 'TTO', 'TUN', 'TUR', 0], dtype=object)
61
10.4 Finally, we want to look at the mortality data by income. We expect higher income countries
to have lower pollution-related mortality. Find out if this assumption holds. Calculate summary
statistics and histograms for each income category and note any trends.
Answer.
In [78]: inmortplah=pd.merge(pmlowapmhigh, WDI_Country[['Country Code','Income Group']], how =

inmortphal=pd.merge(pmhighapmlow, WDI_Country[['Country Code','Income Group']], how =
inmortplah=inmortplah.drop(columns=['pop_est','continent','name','gdp_md_est'])
inmortphal=inmortphal.drop(columns=['pop_est','continent','name','gdp_md_est'])
In [79]: HI=['High income']

LI=['Low income']
LMI=['Lower middle income']
UMI=['Upper middle income']
incomecats=[HI, LI, LMI, UMI]
PM2.5 Low w/ APM High by income
In [80]: #High Income

inmortplahHI=inmortplah[inmortplah['Income Group'].isin(HI)].copy()
#Low Income
inmortplahLI=inmortplah[inmortplah['Income Group'].isin(LI)].copy()
#lower Middle Income
inmortplahLMI=inmortplah[inmortplah['Income Group'].isin(LMI)].copy()
#Upper Middle Income
inmortplahUMI=inmortplah[inmortplah['Income Group'].isin(UMI)].copy()
PM2.5 High w/ APM Low by income
In [81]: #High Income

inmortphalHI=inmortphal[inmortphal['Income Group'].isin(HI)].copy()
#Low Income
inmortphalLI=inmortphal[inmortphal['Income Group'].isin(LI)].copy()
#lower Middle Income
inmortphalLMI=inmortphal[inmortphal['Income Group'].isin(LMI)].copy()
#Upper Middle Income
inmortphalUMI=inmortphal[inmortphal['Income Group'].isin(UMI)].copy()
23 Low Income
In [82]: inmortplahLI.describe()
Out[82]: APM Ind. Value PM2.5 Ind. Value

count 67.000000 67.000000
mean 192.665672 20.040694
std 71.315888 2.806193
min 110.000000 14.942271
25% 159.600000 17.589239
62
50% 170.200000 21.196673
75% 243.300000 22.625162
max 324.100000 23.565921
In [83]: LIgraphBest=inmortplahLI['APM Ind. Value'].hist(color='Teal')
In [84]: LIgraphPerspective=inmortplahLI['APM Ind. Value'].hist(color='Teal').axis([0,300,0,100
63
In [85]: inmortphalLI.describe()
count 0.0 0.0
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
24 Lower Middle Income

In [86]: inmortplahLMI.describe()
count 124.000000 124.000000
mean 136.816129 18.388129
std 42.006760 3.778774
min 70.700000 11.590909
25% 112.400000 14.629720
50% 137.000000 18.306346
75% 139.800000 22.006672
max 269.100000 23.756719
In [87]: LMI1graphBest=inmortplahLMI['APM Ind. Value'].hist(color='Teal')
64
In [88]: LMI1graphPerspective=inmortplahLMI['APM Ind. Value'].hist(color='Teal').axis([0,300,0,
In [89]: inmortphalLMI.describe()

count 65.000000 65.000000
mean 43.447692 30.884284
std 22.029122 3.516157
min 0.000000 23.952679
25% 41.900000 27.808026
50% 49.100000 30.967813
75% 60.700000 33.492964
max 63.700000 38.450491
In [90]: LMI2graphBest=inmortphalLMI['APM Ind. Value'].hist(color='Grey')
65
In [91]: LMI2graphPerspective=inmortphalLMI['APM Ind. Value'].hist(color='Grey').axis([0,300,0,
66
25 Upper Middle Income
In [92]: inmortplahUMI.describe()

count 46.000000 46.000000
mean 85.254348 18.970857
std 14.692322 4.538723
min 68.000000 10.840160
25% 68.150000 13.609475
50% 79.300000 21.426543
75% 99.000000 22.513662
max 107.800000 23.721890
In [93]: UMI1graphBest=inmortplahUMI['APM Ind. Value'].hist(color='Teal')
In [94]: UMI1graphPerspective=inmortplahUMI['APM Ind. Value'].hist(color='Teal').axis([0,300,0,
67
In [95]: inmortphalUMI.describe()

count 139.000000 139.000000
mean 54.072662 32.128471
std 6.907702 5.421370
min 36.700000 23.792352
25% 49.700000 28.131117
50% 51.400000 31.174724
75% 61.500000 36.272874
max 63.900000 45.401556
In [96]: UMI2graphBest=inmortphalUMI['APM Ind. Value'].hist(color='Grey')
68
In [97]: UMI2graphPerspective=inmortphalUMI['APM Ind. Value'].hist(color='Grey').axis([0,300,0,
69
26 High Income
In [98]: inmortplahHI.describe()

count 0.0 0.0
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
In [99]: inmortphalHI.describe()

count 82.000000 82.000000
mean 38.890244 39.473241
std 13.603763 20.863216
min 15.400000 23.904687
25% 25.300000 25.957023
50% 38.600000 29.723332
75% 53.900000 40.104557
max 54.700000 94.403891
In [100]: HIgraphBest=inmortphalHI['APM Ind. Value'].hist(color='Grey')
70
In [101]: HIgraphPerspective=inmortphalHI['APM Ind. Value'].hist(color='Grey').axis([0,300,0,10
Answer
In general, the assumption is correct that higher income countries have lower mortality rates
due to ambient air pollution. However, I will note that there was an interesting and significant
spike in mortality rates from low income to lower middle income.
10.5 At the start, we asked some questions. Based on your analysis, provide a short answer to
each of these:
1. Are we making any progress in reducing the amount of emitted pollutants across the globe?
2. Which are the critical regions where we should start environmental campaigns?
3. Are we making any progress in the prevention of deaths related to air pollution?
4. Which demographic characteristics seem to correlate with the number of health-related is-
sues derived from air pollution?
Answer.
1. Very little progress is being made.

2. Europe and Central Asia, East Asia and Pacific, North America
3. Not really.
4. Nationality(region) and income. In a separate project, one could even investigate disparaties
based on gender using the series codes ’SH.STA.AIRP.FE.P5" for the female population and
“SH.STA.AIRP.MA.P5” for the male population.
71

Extended Case Code Sample

Uploaded by

Copyright:

Available Formats

You might also like

Extended Case Code Sample

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Extended Case Code Sample

Uploaded by

Copyright:

Available Formats

extended_case_2_fellow

January 13, 2021

1 The adverse health effects of air pollution - are we making any

In [1]: # Load relevant packages

from pandas.plotting import andrews_curves

warnings.filterwarnings("ignore") # Suppress all warnings

In [2]: %env PROJ_DIR=c:\

In [3]: world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

Out[102]: pop_est continent name iso_a3 \

0 POLYGON ((61.21081709172574 35.65007233330923, 62.23065148300589 35.270663967422

[177 rows x 6 columns]

1.2 Extracting and cleaning relevant data

In [5]: WDI_data = pd.read_csv("./files/WDI_csv/WDIData.csv")

Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',

Out[5]: Country Name Country Code \

Indicator Code 1960 1961 1962 \

1963 1964 1965 ... 2011 \

2012 2013 2014 2015 2016 \

2017 2018 2019 Unnamed: 64

[377256 rows x 65 columns]

In [6]: WDI_ids = pd.read_csv("./files/WDI_csv/WDISeries.csv")

Index(['Series Code', 'Topic', 'Indicator Name', 'Short definition',

Out[6]: Series Code Topic \

Indicator Name Short definition \

Long definition Unit of measure \

Periodicity Base Period Other notes Aggregation method ... \

Notes from original source General comments \

Statistical concept and methodology \

Development relevance Related source links \

Other web links Related indicators License Type Unnamed: 20

In [12]: Emissions_melt=Emissions_df.melt(id_vars=['Indicator Code','Indicator Name_x', 'Countr

In [14]: WDI_Country = pd.read_csv("./files/WDI_csv/WDICountry.csv")

In [16]: #for Region_Array in Emissions_R_df["Country Code"]:

Emissions_R_df.rename(columns={"Country Name":"Region Name"})

Out[16]: Indicator Code Indicator Name_x \

Region Name Country Code Year \

[62902 rows x 6 columns]

In [17]: Emissions_C_df = Emissions_melt[Emissions_melt["Country Code"].isin(Region_Array)==Fal

Out[17]: Indicator Code \

Country Name Country Code Year Indicator Value

[262956 rows x 6 columns]

1.3 Finalizing the cleaning for our study

• Methane emissions (kt of CO2 equivalent), EN.ATM.METH.KT.CE: Methane emissions

• Nitrous oxide emissions (kt of CO2 equivalent), EN.ATM.NOXE.KT.CE: Nitrous oxide

In [19]: #Extract desired codes

#Clean up names by shortening description

#Extract desired codes

#Clean up names by shortening description

In [20]: ePMc = emit_C_df[emit_C_df['Indicator Code'].isin(l)].copy()

A dataframe of the indicators by year and country with PM2.5WHO isolated.

1.4 Where shall the client start environmental campaigns?

In [21]: C02 = ["C02"]

Out[22]: count 9.856000e+03

Out[23]: count 8.736000e+03

Out[24]: count 8779.000000

Out[25]: count 7.971000e+03

Consider plotting mean over time.

In [26]: pollutantsreg = emit_C_df[emit_C_df["Indicator Code"].isin(Pollutants)].copy()