Professional Documents
Culture Documents
Air Quality
Air Quality
1
import seaborn as sns
import missingno as msno
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
import scipy.stats as stats
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score as r2
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV
from sklearn.preprocessing import PolynomialFeatures
import warnings
2
9466 NaN NaN NaN NaN NaN
9467 NaN NaN NaN NaN NaN
9468 NaN NaN NaN NaN NaN
9469 NaN NaN NaN NaN NaN
9470 NaN NaN NaN NaN NaN
[6]: 0 1 2 3 4 \
Date 10/03/2004 10/03/2004 10/03/2004 10/03/2004 10/03/2004
Time 18.00.00 19.00.00 20.00.00 21.00.00 22.00.00
CO(GT) 2,6 2 2,2 2,2 1,6
PT08.S1(CO) 1360.0 1292.0 1402.0 1376.0 1272.0
NMHC(GT) 150.0 112.0 88.0 80.0 51.0
C6H6(GT) 11,9 9,4 9,0 9,2 6,5
PT08.S2(NMHC) 1046.0 955.0 939.0 948.0 836.0
NOx(GT) 166.0 103.0 131.0 172.0 131.0
PT08.S3(NOx) 1056.0 1174.0 1140.0 1092.0 1205.0
NO2(GT) 113.0 92.0 114.0 122.0 116.0
PT08.S4(NO2) 1692.0 1559.0 1555.0 1584.0 1490.0
PT08.S5(O3) 1268.0 972.0 1074.0 1203.0 1110.0
T 13,6 13,3 11,9 11,0 11,2
RH 48,9 47,7 54,0 60,0 59,6
AH 0,7578 0,7255 0,7502 0,7867 0,7888
5 6 7 8 9
3
Date 10/03/2004 11/03/2004 11/03/2004 11/03/2004 11/03/2004
Time 23.00.00 00.00.00 01.00.00 02.00.00 03.00.00
CO(GT) 1,2 1,2 1 0,9 0,6
PT08.S1(CO) 1197.0 1185.0 1136.0 1094.0 1010.0
NMHC(GT) 38.0 31.0 31.0 24.0 19.0
C6H6(GT) 4,7 3,6 3,3 2,3 1,7
PT08.S2(NMHC) 750.0 690.0 672.0 609.0 561.0
NOx(GT) 89.0 62.0 62.0 45.0 -200.0
PT08.S3(NOx) 1337.0 1462.0 1453.0 1579.0 1705.0
NO2(GT) 96.0 77.0 76.0 60.0 -200.0
PT08.S4(NO2) 1393.0 1333.0 1333.0 1276.0 1235.0
PT08.S5(O3) 949.0 733.0 730.0 620.0 501.0
T 11,2 11,3 10,7 10,7 10,3
RH 59,2 56,8 60,0 59,7 60,2
AH 0,7848 0,7603 0,7702 0,7648 0,7517
Let’s check data types of our features and potential missing values
[7]: df.dtypes
We should keep in mind that C6H6(GT) is the target feature in this analysis.
[8]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9471 entries, 0 to 9470
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 9357 non-null object
1 Time 9357 non-null object
4
2 CO(GT) 9357 non-null object
3 PT08.S1(CO) 9357 non-null float64
4 NMHC(GT) 9357 non-null float64
5 C6H6(GT) 9357 non-null object
6 PT08.S2(NMHC) 9357 non-null float64
7 NOx(GT) 9357 non-null float64
8 PT08.S3(NOx) 9357 non-null float64
9 NO2(GT) 9357 non-null float64
10 PT08.S4(NO2) 9357 non-null float64
11 PT08.S5(O3) 9357 non-null float64
12 T 9357 non-null object
13 RH 9357 non-null object
14 AH 9357 non-null object
dtypes: float64(8), object(7)
memory usage: 1.1+ MB
[10]: df
5
9466 NaN NaN NaN NaN NaN
9467 NaN NaN NaN NaN NaN
9468 NaN NaN NaN NaN NaN
9469 NaN NaN NaN NaN NaN
9470 NaN NaN NaN NaN NaN
PT08.S5(O3) T RH AH
0 1268.0 13,6 48,9 0,7578
1 972.0 13,3 47,7 0,7255
2 1074.0 11,9 54,0 0,7502
3 1203.0 11,0 60,0 0,7867
4 1110.0 11,2 59,6 0,7888
… … … … …
9466 NaN NaN NaN NaN
9467 NaN NaN NaN NaN
9468 NaN NaN NaN NaN
9469 NaN NaN NaN NaN
9470 NaN NaN NaN NaN
Now, we can prepare some string-type features and convert them to float. It’s also essential to
remember to replace “,” with “.” before the conversion.
[11]: def str_to_float(features):
for feature in features:
df[feature] = df[feature].str.replace(",", ".")
df[feature] = df[feature].astype(np.float64)
Looking at the previous table output, we can observe rows with all NaN values. Let’s identify
which indices in the dataframe have only NaN values and remove them.
[13]: df[df.isnull().all(axis=1)].index
[13]: Index([9357, 9358, 9359, 9360, 9361, 9362, 9363, 9364, 9365, 9366,
…
9461, 9462, 9463, 9464, 9465, 9466, 9467, 9468, 9469, 9470],
dtype='int64', length=114)
6
[15]: Date object
Time object
CO(GT) float64
PT08.S1(CO) float64
NMHC(GT) float64
C6H6(GT) float64
PT08.S2(NMHC) float64
NOx(GT) float64
PT08.S3(NOx) float64
NO2(GT) float64
PT08.S4(NO2) float64
PT08.S5(O3) float64
T float64
RH float64
AH float64
dtype: object
[16]: df.head(10).T
[16]: 0 1 2 3 4 \
Date 10/03/2004 10/03/2004 10/03/2004 10/03/2004 10/03/2004
Time 18.00.00 19.00.00 20.00.00 21.00.00 22.00.00
CO(GT) 2.6 2.0 2.2 2.2 1.6
PT08.S1(CO) 1360.0 1292.0 1402.0 1376.0 1272.0
NMHC(GT) 150.0 112.0 88.0 80.0 51.0
C6H6(GT) 11.9 9.4 9.0 9.2 6.5
PT08.S2(NMHC) 1046.0 955.0 939.0 948.0 836.0
NOx(GT) 166.0 103.0 131.0 172.0 131.0
PT08.S3(NOx) 1056.0 1174.0 1140.0 1092.0 1205.0
NO2(GT) 113.0 92.0 114.0 122.0 116.0
PT08.S4(NO2) 1692.0 1559.0 1555.0 1584.0 1490.0
PT08.S5(O3) 1268.0 972.0 1074.0 1203.0 1110.0
T 13.6 13.3 11.9 11.0 11.2
RH 48.9 47.7 54.0 60.0 59.6
AH 0.7578 0.7255 0.7502 0.7867 0.7888
5 6 7 8 9
Date 10/03/2004 11/03/2004 11/03/2004 11/03/2004 11/03/2004
Time 23.00.00 00.00.00 01.00.00 02.00.00 03.00.00
CO(GT) 1.2 1.2 1.0 0.9 0.6
PT08.S1(CO) 1197.0 1185.0 1136.0 1094.0 1010.0
NMHC(GT) 38.0 31.0 31.0 24.0 19.0
C6H6(GT) 4.7 3.6 3.3 2.3 1.7
PT08.S2(NMHC) 750.0 690.0 672.0 609.0 561.0
NOx(GT) 89.0 62.0 62.0 45.0 NaN
PT08.S3(NOx) 1337.0 1462.0 1453.0 1579.0 1705.0
NO2(GT) 96.0 77.0 76.0 60.0 NaN
7
PT08.S4(NO2) 1393.0 1333.0 1333.0 1276.0 1235.0
PT08.S5(O3) 949.0 733.0 730.0 620.0 501.0
T 11.2 11.3 10.7 10.7 10.3
RH 59.2 56.8 60.0 59.7 60.2
AH 0.7848 0.7603 0.7702 0.7648 0.7517
Examine the number of missing values for each feature. We can also display the count of non-NaN
values for each feature.
[17]: df.isna().sum()
[17]: Date 0
Time 0
CO(GT) 1683
PT08.S1(CO) 366
NMHC(GT) 8443
C6H6(GT) 366
PT08.S2(NMHC) 366
NOx(GT) 1639
PT08.S3(NOx) 366
NO2(GT) 1642
PT08.S4(NO2) 366
PT08.S5(O3) 366
T 366
RH 366
AH 366
dtype: int64
[18]: msno.bar(df)
8
Due to the fact that only slightly less than 10% of the data in the NMHC(GT) feature is non-NaN,
we can consider removing it entirely.
[19]: drop_features(["NMHC(GT)"])
We can extract two new features, Hour and Month, which will be needed later.
[20]: df["Hour"] = df["Time"].apply(lambda x: int(x.split(".")[0]))
drop_features(["Time"])
Let’s remove rows containing missing values in the target feature. We cannot fill values there, as
it would lead to data leakage.
[22]: df = df[df["C6H6(GT)"].notna()]
We can fill in the remaining features using a method that propagates the last valid observation
forward to the next valid one. Therefore, it is essential to ensure that the data is sorted by date
before applying this technique. This approach is effective for handling missing values because
it avoids referencing overall averages, medians, etc., for the entire period. Instead, it considers
each specific short period, typically hour or few hours, assuming that air quality does not change
significantly over these periods where data is missing.
[23]: df = df.sort_values(by=["Date", "Hour"])
Let’s observe the number of air quality measurements for each day and year.
[25]: df.index.day.value_counts()
[25]: Date
23 312
12 312
13 312
22 312
18 312
2 309
21 308
1 308
24 307
25 307
14 305
9
28 302
19 302
17 292
11 290
3 288
15 288
30 288
20 288
27 288
16 288
6 288
7 287
5 287
29 286
26 285
4 279
10 270
8 263
9 237
31 191
Name: count, dtype: int64
[26]: df.index.year.value_counts()
[26]: Date
2004 6882
2005 2109
Name: count, dtype: int64
Given that the measurement period spans from March 2004 to March 2005 inclusive, we can safely
remove the ‘Date’ column since we have already extracted ‘Hour’ and ‘Month’ earlier. It is rea-
sonable to assume that the year and day-of-month will no longer be needed. From personal obser-
vations, I can confirm that air quality changes concerning specific weekdays are not significantly
different for most of the cities. Hence, we will not make a distinction between Monday, Tuesday,
etc.
Due to the removal of the year, some days in March will be duplicated, but this won’t impact
regression models significantly.
[27]: df.reset_index(inplace=True)
drop_features(["Date"])
df.T
[27]: 0 1 2 3 4 \
CO(GT) 2.6000 2.0000 2.2000 2.2000 1.6000
PT08.S1(CO) 1360.0000 1292.0000 1402.0000 1376.0000 1272.0000
C6H6(GT) 11.9000 9.4000 9.0000 9.2000 6.5000
PT08.S2(NMHC) 1046.0000 955.0000 939.0000 948.0000 836.0000
10
NOx(GT) 166.0000 103.0000 131.0000 172.0000 131.0000
PT08.S3(NOx) 1056.0000 1174.0000 1140.0000 1092.0000 1205.0000
NO2(GT) 113.0000 92.0000 114.0000 122.0000 116.0000
PT08.S4(NO2) 1692.0000 1559.0000 1555.0000 1584.0000 1490.0000
PT08.S5(O3) 1268.0000 972.0000 1074.0000 1203.0000 1110.0000
T 13.6000 13.3000 11.9000 11.0000 11.2000
RH 48.9000 47.7000 54.0000 60.0000 59.6000
AH 0.7578 0.7255 0.7502 0.7867 0.7888
Hour 18.0000 19.0000 20.0000 21.0000 22.0000
Month 3.0000 3.0000 3.0000 3.0000 3.0000
5 6 7 8 9 … \
CO(GT) 1.2000 1.2000 1.0000 0.9000 0.6000 …
PT08.S1(CO) 1197.0000 1185.0000 1136.0000 1094.0000 1010.0000 …
C6H6(GT) 4.7000 3.6000 3.3000 2.3000 1.7000 …
PT08.S2(NMHC) 750.0000 690.0000 672.0000 609.0000 561.0000 …
NOx(GT) 89.0000 62.0000 62.0000 45.0000 45.0000 …
PT08.S3(NOx) 1337.0000 1462.0000 1453.0000 1579.0000 1705.0000 …
NO2(GT) 96.0000 77.0000 76.0000 60.0000 60.0000 …
PT08.S4(NO2) 1393.0000 1333.0000 1333.0000 1276.0000 1235.0000 …
PT08.S5(O3) 949.0000 733.0000 730.0000 620.0000 501.0000 …
T 11.2000 11.3000 10.7000 10.7000 10.3000 …
RH 59.2000 56.8000 60.0000 59.7000 60.2000 …
AH 0.7848 0.7603 0.7702 0.7648 0.7517 …
Hour 23.0000 0.0000 1.0000 2.0000 3.0000 …
Month 3.0000 3.0000 3.0000 3.0000 3.0000 …
11
PT08.S2(NMHC) 1101.0000 1027.0000 1063.0000 961.0000 1047.0000
NOx(GT) 472.0000 353.0000 293.0000 235.0000 265.0000
PT08.S3(NOx) 539.0000 604.0000 603.0000 702.0000 654.0000
NO2(GT) 190.0000 179.0000 175.0000 156.0000 168.0000
PT08.S4(NO2) 1374.0000 1264.0000 1241.0000 1041.0000 1129.0000
PT08.S5(O3) 1729.0000 1269.0000 1092.0000 770.0000 816.0000
T 21.9000 24.3000 26.9000 28.3000 28.5000
RH 29.3000 23.7000 18.3000 13.5000 13.1000
AH 0.7568 0.7119 0.6406 0.5139 0.5028
Hour 10.0000 11.0000 12.0000 13.0000 14.0000
Month 4.0000 4.0000 4.0000 4.0000 4.0000
[28]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8991 entries, 0 to 8990
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CO(GT) 8991 non-null float64
1 PT08.S1(CO) 8991 non-null float64
2 C6H6(GT) 8991 non-null float64
3 PT08.S2(NMHC) 8991 non-null float64
4 NOx(GT) 8991 non-null float64
5 PT08.S3(NOx) 8991 non-null float64
6 NO2(GT) 8991 non-null float64
7 PT08.S4(NO2) 8991 non-null float64
8 PT08.S5(O3) 8991 non-null float64
9 T 8991 non-null float64
10 RH 8991 non-null float64
11 AH 8991 non-null float64
12 Hour 8991 non-null int64
13 Month 8991 non-null int32
dtypes: float64(12), int32(1), int64(1)
memory usage: 948.4 KB
1.1.2 Duplicates
Let’s check for duplicates in our dataset:
[29]: df.duplicated().sum()
[29]: 0
12
1.1.3 Feature Division
[30]: target = "C6H6(GT)"
target_feature = [target]
categorical_features = ["Month", "Hour"]
numerical_features = [col for col in df.columns if col not in␣
↪(categorical_features + target_feature)]
[31]: df[target_feature]
[31]: C6H6(GT)
0 11.9
1 9.4
2 9.0
3 9.2
4 6.5
… …
8986 13.5
8987 11.4
8988 12.4
8989 9.5
8990 11.9
[32]: df[categorical_features]
[33]: df[numerical_features]
13
3 2.2 1376.0 948.0 172.0 1092.0 122.0
4 1.6 1272.0 836.0 131.0 1205.0 116.0
… … … … … … …
8986 3.1 1314.0 1101.0 472.0 539.0 190.0
8987 2.4 1163.0 1027.0 353.0 604.0 179.0
8988 2.4 1142.0 1063.0 293.0 603.0 175.0
8989 2.1 1003.0 961.0 235.0 702.0 156.0
8990 2.2 1071.0 1047.0 265.0 654.0 168.0
PT08.S4(NO2) PT08.S5(O3) T RH AH
0 1692.0 1268.0 13.6 48.9 0.7578
1 1559.0 972.0 13.3 47.7 0.7255
2 1555.0 1074.0 11.9 54.0 0.7502
3 1584.0 1203.0 11.0 60.0 0.7867
4 1490.0 1110.0 11.2 59.6 0.7888
… … … … … …
8986 1374.0 1729.0 21.9 29.3 0.7568
8987 1264.0 1269.0 24.3 23.7 0.7119
8988 1241.0 1092.0 26.9 18.3 0.6406
8989 1041.0 770.0 28.3 13.5 0.5139
8990 1129.0 816.0 28.5 13.1 0.5028
extended_describe["lower_limit"] = extended_describe["25%"] - 1.
↪5*extended_describe["IQR"]
extended_describe["upper_limit"] = extended_describe["75%"] + 1.
↪5*extended_describe["IQR"]
return extended_describe
14
C6H6(GT) 8991.0 10.083105 7.449820 0.1000 4.4000
PT08.S2(NMHC) 8991.0 939.153376 266.831429 383.0000 734.5000
NOx(GT) 8991.0 236.863864 201.004000 2.0000 96.0000
PT08.S3(NOx) 8991.0 835.493605 256.817320 322.0000 658.0000
NO2(GT) 8991.0 108.525637 46.565848 2.0000 73.0000
PT08.S4(NO2) 8991.0 1456.264598 346.206794 551.0000 1227.0000
PT08.S5(O3) 8991.0 1022.906128 398.484288 221.0000 731.5000
T 8991.0 18.317829 8.832116 -1.9000 11.8000
RH 8991.0 49.234201 17.316892 9.2000 35.8000
AH 8991.0 1.025530 0.403813 0.1847 0.7368
Hour 8991.0 11.479591 6.913320 0.0000 5.0000
Month 8991.0 6.327772 3.407854 1.0000 3.0000
upper_limit
CO(GT) 5.25000
PT08.S1(CO) 1672.00000
C6H6(GT) 28.40000
PT08.S2(NMHC) 1688.25000
NOx(GT) 641.00000
PT08.S3(NOx) 1436.75000
NO2(GT) 230.50000
PT08.S4(NO2) 2344.50000
PT08.S5(O3) 2086.50000
T 43.30000
RH 102.55000
AH 2.17905
Hour 35.00000
Month 18.00000
Let’s also examine the number of unique values for a particular feature:
15
[36]: df.nunique()
[36]: CO(GT) 94
PT08.S1(CO) 1041
C6H6(GT) 407
PT08.S2(NMHC) 1245
NOx(GT) 898
PT08.S3(NOx) 1221
NO2(GT) 274
PT08.S4(NO2) 1603
PT08.S5(O3) 1743
T 436
RH 753
AH 6683
Hour 24
Month 12
dtype: int64
extended_describe = get_extended_description(df[numerical])
16
g = sns.histplot(data=df, x=feature, ax=axes[i], kde=True)
axes[i].axvline(extended_describe.loc[feature, "upper_limit"],␣
↪linewidth=2 , color="red")
axes[i].axvline(extended_describe.loc[feature, "lower_limit"],␣
↪linewidth=2 , color="red")
g.set(ylabel=None)
plt.tight_layout()
plt.show()
axes[i].tick_params(axis="x", rotation=90)
axes[i].grid(axis="y")
17
for i, feature in enumerate(numerical):
sns.boxplot(data=df, x=x, y=feature, hue=hue, ax=axes[i])
axes[i].tick_params(axis="x", rotation=90)
axes[i].grid(axis="y")
plt.tight_layout()
plt.show()
axes[i].grid()
plt.tight_layout()
plt.show()
/Users/kkozik/Library/anaconda3/envs/psi_env/lib/python3.10/site-
packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to
tight
self._figure.tight_layout(*args, **kwargs)
18
We can observe that the distribution of the target data is slightly positively skewed.
Categorical Features
[46]: cat_countplot(categorical_features[0])
19
[47]: cat_countplot(categorical_features[1])
20
We can notice that a higher number of measurements were taken in March. This is because it’s
the only month considered in two different years.
Numerical Features
[48]: descriptive_plots(df)
21
All numerical features’ distributions are close to Gaussian distributions. Some of them have slight
skewness, hence in the further analysis, they will undergo transformations.
[49]: violin_plots(df)
22
23
[50]: box_plots(df)
24
25
Similar conclusions can be drawn based on the above boxplots and violin plots. Some of the data
needs to undergo transformation to improve data quality and enable the application of techniques
suitable for this model, such as the standard scaler. This scaler should be applied to data with
a distribution similar to Gaussian to enhance data quality and facilitate the use of appropriate
modeling techniques.
26
27
In terms of average CO concentration, the most dense distribution is around 2 mg/m^3 for all
months. Additionally, a notable peak in density is observed for higher values of CO concentration,
specifically around 6 mg/m^3, occurring predominantly in October.
We can observe that the peak density in the hourly averaged sensor responses for CO, NMHC,
NOx, and O3 is concentrated around specific values throughout all months. However, the sensor
for NO2 exhibits peaks at lower values during the months between November and February. During
spring and summer, the hourly averaged sensor responses for NO2 show higher values.
The majority of values for the true hourly averaged NO2 concentration are around 50 (µg/m^3)
only in August. The remaining months exhibit a distinctly different density distribution, generally
with higher values. It may be caused by seasonal variations influenced by factors such as weather
conditions or industrial activities. A potential reason for this phenomenon could be the high
temperature and low humidity, as evident in plots featuring temperature (T) and relative humidity
(RH) as key features.
28
29
For the majority of records, average concentrations of CO, NO2, NOx and C6H6 are lower in the
time range between 1 am to 5 am. This could be attributed to reduced human activity during these
early morning hours, leading to lower emissions from sources such as vehicle exhaust and industrial
facilities.
We can observe a drop in relative humidity values during day (from 12 to 17pm).
• Violin plots numerical vs categorical
[53]: violin_plots(df, x=categorical_features[0])
30
31
[54]: violin_plots(df, x=categorical_features[1])
32
33
The same observations can be deduced from violin plots, noting more details such as quartiles,
median values, and distributions.
• Box plots numerical vs categorical
[55]: box_plots(df, x=categorical_features[0])
34
35
[56]: box_plots(df, x=categorical_features[1])
36
37
On some of the boxplots, certain outliers are noticeable. It’s important to emphasize that these
outliers represent a naturally distributed pattern and shouldn’t be artificially removed from the
model. This approach ensures a more accurate reflection of reality.
• scatter plots
[57]: scatter_plots(df, hue=categorical_features[0])
38
39
[58]: scatter_plots(df, hue=categorical_features[1])
40
41
From the scatterplots above, we can see strong correlations between some features and the target
variable. Especially notable is the strong positive correlation between the target variable and
PT08.S2(NMHC). Also, worth mentioning are the correlations between the target variable and
CO(GT), PT08.S1(CO), PT08.S3(NOx), PT08.S5(O3).
These patterns are also clearly evident when analyzing the correlation matrix and pairplot, which
will be presented below:
• pairplots
[59]: sns.pairplot(df[numerical_features+target_feature])
/Users/kkozik/Library/anaconda3/envs/psi_env/lib/python3.10/site-
packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to
tight
self._figure.tight_layout(*args, **kwargs)
42
• Pearson Correlations
[60]: fig, ax = plt.subplots(figsize=(15, 12))
sns.heatmap(df[numerical_features+target_feature].corr(), cmap="coolwarm",␣
↪annot=True, fmt=".2f", vmin=-1, vmax=1)
43
Let’s create a function to examine the statistical significance of correlations between pairs of features
in a given dataset. If the p-value is greater than 0.05, the function considers the correlation between
two features as statistically insignificant. This threshold (0.05) is a common significance level, but
it can be adjusted based on the desired level of confidence in hypothesis testing.
[61]: def check_correlation_importance(data, corr_type):
cnt = 0
feature_names = data.columns.values
for f1 in feature_names:
for f2 in feature_names:
if corr_type == "pearson":
statistic, pvalue = stats.pearsonr(data[f1], data[f2])
elif corr_type == "spearman":
statistic, pvalue = stats.spearmanr(data[f1], data[f2])
else:
return
44
if statistic < 0.7 and statistic > -0.7:
continue
if pvalue > 0.05:
cnt += 1
print(f"Statistically insignificant correlation between {f1}␣
↪and {f2} with correlation coefficient {statistic}", end="\n\n")
if cnt == 0:
print("All correlation pairs are statistically significant")
[62]: check_correlation_importance(df[numerical_features+target_feature],␣
↪corr_type="pearson")
45
[64]: check_correlation_importance(df[numerical_features+target_feature],␣
↪corr_type="spearman")
46
It’s worth mentioning that transformations can also be applied to the target feature. However,
it’s important to remember to apply the inverse transformation when evaluating the model results,
especially for metrics like R2, MAE, and MSE.
[65]: features_with_skewness = ["CO(GT)", "PT08.S1(CO)", "PT08.S2(NMHC)", "NOx(GT)",
"PT08.S3(NOx)", "NO2(GT)", "PT08.S5(O3)", "C6H6(GT)"]
df[features_with_skewness] = df[features_with_skewness].apply(lambda x: np.
↪log(x))
[66]: descriptive_plots(df)
47
[71]: X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2,␣
↪random_state=100)
[72]: X_train
2.3 Standardization
For such prepared numerical data, we can apply standardization. It’s worth mentioning that there
is no need to standardize the target feature since it’s just a scaling of the data. It will be also easier
to apply only the inverse transformation when evaluating the results.
[73]: std_scaler = StandardScaler().fit(X_train[numerical_features])
X_train_scaled = std_scaler.transform(X_train[numerical_features])
X_test_scaled = std_scaler.transform(X_test[numerical_features])
48
X_train = pd.concat([X_train_num, X_train[categorical_features]], axis=1)
X_test = pd.concat([X_test_num, X_test[categorical_features]], axis=1)
[74]: X_train
[74]: X_test
49
PT08.S4(NO2) PT08.S5(O3) T RH AH Month Hour
3844 -0.258491 -1.807061 1.205387 -0.872496 0.781517 8 10
4191 0.505335 0.268576 1.273338 -1.241058 0.246674 9 18
8727 -0.203518 0.697546 0.650451 -0.918566 -0.098123 3 15
7890 -1.375297 -0.630784 -1.025682 -0.400276 -1.309233 2 17
6053 -0.510206 0.592964 -0.878454 1.788061 -0.006556 11 3
… … … … … … … …
350 0.762837 0.108974 -1.093634 1.459810 -0.456495 3 8
79 0.482189 1.184060 -0.425445 0.129532 -0.430086 3 1
8039 -0.709842 0.566114 -1.591943 1.707438 -0.902733 2 22
6936 0.317272 1.776868 -0.969056 0.953037 -0.570523 1 10
5640 -0.270064 0.126397 0.084190 0.572958 0.659344 11 22
50
[78]: descriptive_plots(X_train, include_target=False)
51
[79]: descriptive_plots(X_test, include_target=False)
52
• Target Feature Histograms
[80]: sns.histplot(y, kde=True)
plt.title("Transformed Target Feature Histogram")
plt.show()
53
[81]: sns.histplot(y_train, kde=True)
plt.title("Transformed Target Feature (Train) Histogram")
plt.show()
54
[82]: sns.histplot(y_test, kde=True)
plt.title("Transformed Target Feature (Test) Histogram")
plt.show()
55
2.4 Categorical Data Encoding
[83]: oh_encoder = OneHotEncoder(drop="first", sparse=False)
preprocessor = ColumnTransformer(
transformers=[
("cat", oh_encoder, categorical_features)
])
X_train_cat_enc = preprocessor.fit_transform(X_train)
X_test_cat_enc = preprocessor.transform(X_test)
oh_feature_names = preprocessor.named_transformers_["cat"].
↪get_feature_names_out(input_features=categorical_features)
56
X_train = pd.concat([X_train_num, X_train_cat_enc], axis=1)
X_test = pd.concat([X_test_num, X_test_cat_enc], axis=1)
[84]: X_train.T
57
Hour_19 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Hour_20 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Hour_21 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Hour_22 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Hour_23 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
58
Hour_19 0.000000 0.000000 0.000000 0.000000 … 0.000000
Hour_20 0.000000 1.000000 0.000000 0.000000 … 0.000000
Hour_21 0.000000 0.000000 0.000000 0.000000 … 0.000000
Hour_22 0.000000 0.000000 0.000000 0.000000 … 0.000000
Hour_23 0.000000 0.000000 0.000000 0.000000 … 0.000000
59
Hour_19 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Hour_20 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Hour_21 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Hour_22 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Hour_23 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
60
Hour_19 0.000000 0.000000 0.000000
Hour_20 0.000000 0.000000 0.000000
Hour_21 0.000000 0.000000 0.000000
Hour_22 1.000000 0.000000 1.000000
Hour_23 0.000000 0.000000 0.000000
y_train = np.exp(y_train)
y_test = np.exp(y_test)
y_train_pred = np.exp(y_train_pred)
y_test_pred = np.exp(y_test_pred)
61
[87]: def plot_error_distribution(model, X_train, X_test, y_train, y_test,␣
↪model_name):
sns.scatterplot(x=y_test_pred, y=errors)
plt.axhline(y=0, color="magenta")
plt.xlabel("Predicted Values in Test Data")
plt.ylabel("Errors")
plt.title(f"Error Distribution for {model_name}")
plt.grid()
plt.show()
• model
[88]: reg_lin_model = LinearRegression()
reg_lin_model.fit(X_train, y_train)
[88]: LinearRegression()
• results
[89]: print_metrics(reg_lin_model, X_train, X_test, y_train, y_test)
[90]: np.exp(y_train).describe()
[91]: np.exp(y_test).describe()
62
[91]: count 1799.000000
mean 10.036298
std 7.588432
min 0.200000
25% 4.350000
50% 8.000000
75% 13.900000
max 50.800000
Name: C6H6(GT), dtype: float64
63
[93]: reg_ridge_model = Ridge(random_state=100)
reg_ridge_model.fit(X_train, y_train)
[93]: Ridge(random_state=100)
• results
[94]: print_metrics(reg_ridge_model, X_train, X_test, y_train, y_test)
[95]: RidgeCV(alphas=array([ 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8,
0.9,
1. , 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8,
1.9, 2. , 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7,
2.8, 2.9, 3. , 3.1, 3.2, 3.3, 3.4, 3.5, 3.6,
3.7, 3.8, 3.9, 4. , 4.1, 4.2, 4.3, 4.4, 4.5,
4.6, 4.7, 4.8, 4.9, 5. , 5.1, 5.2, 5.3, 5.4,
5.5, 5.6, 5.7, 5.8, 5.9, 6. , 6.1, 6.2, 6.3,
6.4, 6.5, 6.6, 6.7, 6.8, 6.9, 7. , 7.1, 7.2,
7.3, 7.4, 7.5, 7.6, 7.7, 7.8, 7.9, 8. , 8.1,
8.2, 8.3, 8.4,…
92.8, 92.9, 93. , 93.1, 93.2, 93.3, 93.4, 93.5, 93.6,
93.7, 93.8, 93.9, 94. , 94.1, 94.2, 94.3, 94.4, 94.5,
94.6, 94.7, 94.8, 94.9, 95. , 95.1, 95.2, 95.3, 95.4,
95.5, 95.6, 95.7, 95.8, 95.9, 96. , 96.1, 96.2, 96.3,
96.4, 96.5, 96.6, 96.7, 96.8, 96.9, 97. , 97.1, 97.2,
97.3, 97.4, 97.5, 97.6, 97.7, 97.8, 97.9, 98. , 98.1,
98.2, 98.3, 98.4, 98.5, 98.6, 98.7, 98.8, 98.9, 99. ,
64
99.1, 99.2, 99.3, 99.4, 99.5, 99.6, 99.7, 99.8, 99.9,
100. ]),
cv=5)
65
• LassoCV
[99]: alphas_lasso = np.linspace(0.1, 100, 1000)
reg_lasso_cv = LassoCV(alphas=alphas_lasso, cv=5)
reg_lasso_cv.fit(X_train, y_train)
[99]: LassoCV(alphas=array([ 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8,
0.9,
1. , 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8,
1.9, 2. , 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7,
2.8, 2.9, 3. , 3.1, 3.2, 3.3, 3.4, 3.5, 3.6,
3.7, 3.8, 3.9, 4. , 4.1, 4.2, 4.3, 4.4, 4.5,
4.6, 4.7, 4.8, 4.9, 5. , 5.1, 5.2, 5.3, 5.4,
5.5, 5.6, 5.7, 5.8, 5.9, 6. , 6.1, 6.2, 6.3,
6.4, 6.5, 6.6, 6.7, 6.8, 6.9, 7. , 7.1, 7.2,
7.3, 7.4, 7.5, 7.6, 7.7, 7.8, 7.9, 8. , 8.1,
8.2, 8.3, 8.4,…
92.8, 92.9, 93. , 93.1, 93.2, 93.3, 93.4, 93.5, 93.6,
93.7, 93.8, 93.9, 94. , 94.1, 94.2, 94.3, 94.4, 94.5,
94.6, 94.7, 94.8, 94.9, 95. , 95.1, 95.2, 95.3, 95.4,
95.5, 95.6, 95.7, 95.8, 95.9, 96. , 96.1, 96.2, 96.3,
96.4, 96.5, 96.6, 96.7, 96.8, 96.9, 97. , 97.1, 97.2,
97.3, 97.4, 97.5, 97.6, 97.7, 97.8, 97.9, 98. , 98.1,
98.2, 98.3, 98.4, 98.5, 98.6, 98.7, 98.8, 98.9, 99. ,
99.1, 99.2, 99.3, 99.4, 99.5, 99.6, 99.7, 99.8, 99.9,
100. ]),
cv=5)
66
[102]: plot_error_distribution(reg_lasso_cv, X_train, X_test, y_train, y_test,
f"LassoCV (alpha={alpha_lasso_cv_best})")
• ElasticNetCV
[103]: alphas_elastic = np.linspace(0.1, 20, 100)
l1_ratios = np.linspace(0.1, 1, 100)
reg_elnet_cv = ElasticNetCV(alphas=alphas_elastic, l1_ratio=l1_ratios, cv=5)
reg_elnet_cv.fit(X_train, y_train)
67
0.78181818, 0.79090909, 0.8 , 0.80909091, 0.81818182,
0.82727273, 0.83636364, 0.84545455, 0.85454545, 0.86363636,
0.87272727, 0.88181818, 0.89090909, 0.9 , 0.90909091,
0.91818182, 0.92727273, 0.93636364, 0.94545455, 0.95454545,
0.96363636, 0.97272727, 0.98181818, 0.99090909, 1. ]))
68
6 Polynomial Regression
[107]: poly_features = PolynomialFeatures(degree=2, interaction_only=True)
X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)
69
6.13…
0.69090909, 0.7 , 0.70909091, 0.71818182, 0.72727273,
0.73636364, 0.74545455, 0.75454545, 0.76363636, 0.77272727,
0.78181818, 0.79090909, 0.8 , 0.80909091, 0.81818182,
0.82727273, 0.83636364, 0.84545455, 0.85454545, 0.86363636,
0.87272727, 0.88181818, 0.89090909, 0.9 , 0.90909091,
0.91818182, 0.92727273, 0.93636364, 0.94545455, 0.95454545,
0.96363636, 0.97272727, 0.98181818, 0.99090909, 1. ]))
70
7 Features Importance
[112]: def plot_feature_importance(model, head, model_name):
columns = X_train.columns
importance_values = np.abs(model.coef_)
indices = np.argsort(importance_values)[::-1][:head]
plt.figure(figsize=(10, 7))
sns.barplot(x=np.array(columns)[indices], y=importance_values[indices])
plt.xticks(rotation=90)
plt.xlabel("Feature Name")
plt.ylabel("Feature Importance")
plt.title(f"Top most important features for {model_name}")
plt.tight_layout()
plt.show()
71
8 Additional Steps: remove most important feature
‘PT08.S2(NMHC)’
[114]: X_train.pop("PT08.S2(NMHC)")
[115]: X_test.pop("PT08.S2(NMHC)")
72
[115]: 8448 -1.206424
5948 1.616440
5053 -0.201218
4839 1.134703
8976 -0.983151
…
3923 0.091460
4876 -0.686929
5733 1.025831
4214 0.431246
4017 0.287059
Name: PT08.S2(NMHC), Length: 1799, dtype: float64
[116]: LinearRegression()
73
[ ]:
[119]: RidgeCV(alphas=array([ 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8,
0.9,
1. , 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8,
1.9, 2. , 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7,
2.8, 2.9, 3. , 3.1, 3.2, 3.3, 3.4, 3.5, 3.6,
3.7, 3.8, 3.9, 4. , 4.1, 4.2, 4.3, 4.4, 4.5,
4.6, 4.7, 4.8, 4.9, 5. , 5.1, 5.2, 5.3, 5.4,
5.5, 5.6, 5.7, 5.8, 5.9, 6. , 6.1, 6.2, 6.3,
6.4, 6.5, 6.6, 6.7, 6.8, 6.9, 7. , 7.1, 7.2,
7.3, 7.4, 7.5, 7.6, 7.7, 7.8, 7.9, 8. , 8.1,
8.2, 8.3, 8.4,…
92.8, 92.9, 93. , 93.1, 93.2, 93.3, 93.4, 93.5, 93.6,
93.7, 93.8, 93.9, 94. , 94.1, 94.2, 94.3, 94.4, 94.5,
94.6, 94.7, 94.8, 94.9, 95. , 95.1, 95.2, 95.3, 95.4,
74
95.5, 95.6, 95.7, 95.8, 95.9, 96. , 96.1, 96.2, 96.3,
96.4, 96.5, 96.6, 96.7, 96.8, 96.9, 97. , 97.1, 97.2,
97.3, 97.4, 97.5, 97.6, 97.7, 97.8, 97.9, 98. , 98.1,
98.2, 98.3, 98.4, 98.5, 98.6, 98.7, 98.8, 98.9, 99. ,
99.1, 99.2, 99.3, 99.4, 99.5, 99.6, 99.7, 99.8, 99.9,
100. ]),
cv=5)
75
8.3 Model With Lasso Regularization and Cross-Validation
[122]: reg_lasso_cv.fit(X_train, y_train)
[122]: LassoCV(alphas=array([ 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8,
0.9,
1. , 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8,
1.9, 2. , 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7,
2.8, 2.9, 3. , 3.1, 3.2, 3.3, 3.4, 3.5, 3.6,
3.7, 3.8, 3.9, 4. , 4.1, 4.2, 4.3, 4.4, 4.5,
4.6, 4.7, 4.8, 4.9, 5. , 5.1, 5.2, 5.3, 5.4,
5.5, 5.6, 5.7, 5.8, 5.9, 6. , 6.1, 6.2, 6.3,
6.4, 6.5, 6.6, 6.7, 6.8, 6.9, 7. , 7.1, 7.2,
7.3, 7.4, 7.5, 7.6, 7.7, 7.8, 7.9, 8. , 8.1,
8.2, 8.3, 8.4,…
92.8, 92.9, 93. , 93.1, 93.2, 93.3, 93.4, 93.5, 93.6,
93.7, 93.8, 93.9, 94. , 94.1, 94.2, 94.3, 94.4, 94.5,
94.6, 94.7, 94.8, 94.9, 95. , 95.1, 95.2, 95.3, 95.4,
95.5, 95.6, 95.7, 95.8, 95.9, 96. , 96.1, 96.2, 96.3,
96.4, 96.5, 96.6, 96.7, 96.8, 96.9, 97. , 97.1, 97.2,
97.3, 97.4, 97.5, 97.6, 97.7, 97.8, 97.9, 98. , 98.1,
98.2, 98.3, 98.4, 98.5, 98.6, 98.7, 98.8, 98.9, 99. ,
99.1, 99.2, 99.3, 99.4, 99.5, 99.6, 99.7, 99.8, 99.9,
100. ]),
cv=5)
76
8.4 Model With ElasticNet Regularization and Cross-Validation
[125]: reg_elnet_cv.fit(X_train, y_train)
77
[126]: print_metrics(reg_elnet_cv, X_train, X_test, y_train, y_test)
[ ]:
78
8.5 Features Importance w/o ‘PT08.S2(NMHC)’ feature
[128]: plot_feature_importance(reg_lin_model, 20, "Base Linear Regression w/o PT08.
↪S2(NMHC)")
79