Professional Documents
Culture Documents
Regression and Eda
Regression and Eda
Regression and Eda
May 3, 2024
[3]: df=pd.read_excel('C:\\Users\\lariy\\Downloads\\capstone\\final_data.xlsx')
df
[4]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2244 entries, 0 to 2243
Data columns (total 8 columns):
1
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 0 non-null float64
1 Unnamed: 1 0 non-null float64
2 Year 2244 non-null datetime64[ns]
3 Gold Prices (USD) 2244 non-null float64
4 Inflation 2192 non-null float64
5 Unemployment rate 2136 non-null float64
6 Interest rates 2244 non-null float64
7 Oil Prices (USD) 2244 non-null float64
dtypes: datetime64[ns](1), float64(7)
memory usage: 140.4 KB
[7]: df.describe()
[27]: df.columns
print(df.head())
2
Year Gold Prices (USD) Inflation Unemployment rate Interest rates \
0 1980-01-04 588.0 13.549202 6.3 18.9
1 1980-01-11 623.0 13.549202 6.3 18.9
2 1980-01-18 835.0 13.549202 6.3 18.9
3 1980-01-25 668.0 13.549202 6.3 18.9
4 1980-02-01 676.5 13.549202 6.9 18.9
3
[10]: #Removing missing values from "inflation" and "unemployment rate"
# Assuming your dataframe is named 'df'
# Remove rows with missing values in both 'Inflation' and 'Unemployment rate'␣
↪columns
4
1 21.59
2 21.59
3 21.59
4 21.59
[12]: df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2136 entries, 0 to 2135
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Year 2136 non-null datetime64[ns]
1 Gold Prices (USD) 2136 non-null float64
2 Inflation 2136 non-null float64
3 Unemployment rate 2136 non-null float64
4 Interest rates 2136 non-null float64
5 Oil Prices (USD) 2136 non-null float64
dtypes: datetime64[ns](1), float64(5)
memory usage: 116.8 KB
5
plt.subplot(3, 2, 4)
sns.histplot(df['Unemployment rate'], kde=True)
plt.title('Distribution of Unemployment rate')
plt.tight_layout()
plt.show()
6
sns.set(style="whitegrid")
plt.tight_layout()
plt.show()
7
[17]: import seaborn as sns
import matplotlib.pyplot as plt
plt.show()
8
[18]: import seaborn as sns
import matplotlib.pyplot as plt
9
[ ]: '''it can be seen that relationship between inflation and gold prices is␣
↪positive theoretically , but according to the correlation matrix it is
negative.It maybe due to external factors like govt policies that the data is␣
↪behaving like this. Also this is not so strong relationship'''
10
def detect_outliers_iqr(data, threshold=1.5):
q1 = data.quantile(0.25)
q3 = data.quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - threshold * iqr
upper_bound = q3 + threshold * iqr
return (data < lower_bound) | (data > upper_bound)
Outliers in Year:
Outliers:
0 False
1 False
2 False
3 False
4 False
…
2131 False
2132 False
2133 False
2134 False
2135 False
Name: Year, Length: 2136, dtype: bool
IQR Outliers:
0 False
1 False
2 False
3 False
4 False
…
2131 False
11
2132 False
2133 False
2134 False
2135 False
Name: Year, Length: 2136, dtype: bool
Outliers in Inflation:
Outliers:
0 True
1 True
2 True
3 True
4 True
…
2131 False
2132 False
2133 False
2134 False
2135 False
Name: Inflation, Length: 2136, dtype: bool
12
IQR Outliers:
0 True
1 True
2 True
3 True
4 True
…
2131 False
2132 False
2133 False
2134 False
2135 False
Name: Inflation, Length: 2136, dtype: bool
13
4 True
…
2131 False
2132 False
2133 False
2134 False
2135 False
Name: Interest rates, Length: 2136, dtype: bool
IQR Outliers:
0 True
1 True
2 True
3 True
4 True
…
2131 False
2132 False
2133 False
2134 False
2135 False
Name: Interest rates, Length: 2136, dtype: bool
14
Name: Oil Prices (USD), Length: 2136, dtype: bool
# Iterate over each column in the dataframe and plot box plots with outliers
for i, column in enumerate(df.columns):
plt.subplot(len(df.columns)//2, 2, i+1)
sns.boxplot(x=df[column], showfliers=True)
plt.title(f'Box Plot of {column}')
# Adjust layout
plt.tight_layout()
plt.show()
15
# Define a function to detect outliers using z-score
def detect_outliers_zscore(data, threshold=3):
z_scores = (data - data.mean()) / data.std()
return (z_scores > threshold) | (z_scores < -threshold)
16
Outlier counts in Inflation:
Z-Score Outliers Count: 104
IQR Outliers Count: 261
# Iterate over each column in the dataframe to detect and store outliers
for column in df.columns:
# Detect outliers using z-score method
zscore_outliers = df[column][detect_outliers_zscore(df[column])]
17
# Display the outlier values for each variable
for column, values in outliers.items():
print(f"Outliers in {column}:")
print("Z-Score Outliers:")
print(values['zscore_outliers'])
print("IQR Outliers:")
print(values['iqr_outliers'])
print()
Outliers in Year:
Z-Score Outliers:
Series([], Name: Year, dtype: datetime64[ns])
IQR Outliers:
Series([], Name: Year, dtype: datetime64[ns])
Outliers in Inflation:
Z-Score Outliers:
0 13.549202
1 13.549202
2 13.549202
3 13.549202
4 13.549202
…
99 10.334715
100 10.334715
101 10.334715
102 10.334715
103 10.334715
Name: Inflation, Length: 104, dtype: float64
IQR Outliers:
0 13.549202
1 13.549202
2 13.549202
3 13.549202
4 13.549202
…
1821 0.118627
1822 0.118627
1823 0.118627
1824 0.118627
18
1825 0.118627
Name: Inflation, Length: 261, dtype: float64
19
28 18.9
29 18.9
30 18.9
31 18.9
32 18.9
33 18.9
34 18.9
35 18.9
36 18.9
37 18.9
38 18.9
39 18.9
40 18.9
41 18.9
42 18.9
43 18.9
44 18.9
45 18.9
46 18.9
47 18.9
48 18.9
49 18.9
50 18.9
51 18.9
Name: Interest rates, dtype: float64
IQR Outliers:
0 18.9
1 18.9
2 18.9
3 18.9
4 18.9
5 18.9
6 18.9
7 18.9
8 18.9
9 18.9
10 18.9
11 18.9
12 18.9
13 18.9
14 18.9
15 18.9
16 18.9
17 18.9
18 18.9
19 18.9
20 18.9
21 18.9
20
22 18.9
23 18.9
24 18.9
25 18.9
26 18.9
27 18.9
28 18.9
29 18.9
30 18.9
31 18.9
32 18.9
33 18.9
34 18.9
35 18.9
36 18.9
37 18.9
38 18.9
39 18.9
40 18.9
41 18.9
42 18.9
43 18.9
44 18.9
45 18.9
46 18.9
47 18.9
48 18.9
49 18.9
50 18.9
51 18.9
Name: Interest rates, dtype: float64
[ ]: '''It can be seen that there are many outliers but share historical event ,␣
↪since the outliers are considered meaningful reflections of historical
21
# Assuming your dataframe is named 'df' with the independent variables as␣
↪columns
Correlation Matrix:
Year Gold Prices (USD) Inflation Unemployment rate \
Year 1.000000 0.775752 -0.532349 -0.501394
Gold Prices (USD) 0.775752 1.000000 -0.205732 -0.014243
Inflation -0.532349 -0.205732 1.000000 0.277274
Unemployment rate -0.501394 -0.014243 0.277274 1.000000
Interest rates -0.841198 -0.560575 0.772618 0.399629
Oil Prices (USD) 0.678407 0.772340 -0.298871 0.036252
# Assuming your dataframe is named 'df' with the independent variables as␣
↪columns
for i in range(len(correlation_matrix.columns)):
for j in range(i+1, len(correlation_matrix.columns)):
22
if abs(correlation_matrix.iloc[i, j]) >= threshold_min and␣
↪abs(correlation_matrix.iloc[i, j]) <= threshold_max:
high_correlation_pairs.append((correlation_matrix.columns[i],␣
↪correlation_matrix.columns[j], correlation_matrix.iloc[i, j]))
[ ]: #PREPROCESSING IS DONE
[31]: # Save the DataFrame with Year and Gold Prices to a CSV file (time series)
df.to_csv("data_revised.csv", index=False)
[43]: #PCA
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
#Since , PCA does not require time variable , we will drop "Year"
data = df.drop(columns=['Year'])
[44]: data
23
2135 1843.0 4.697859 4.6 0.09
[47]: scaled_data
24
Explained Variance Ratio:
[0.53362372 0.26151325 0.13667759 0.0456746 0.02251084]
plt.title('Scree Plot')
plt.xlabel('Number of Principal Components')
plt.ylabel('Explained Variance Ratio')
plt.grid(True)
plt.legend()
plt.show()
25
Elbow Point (k): 1
[53]: '''However, if the scree plot exhibits a downward slope and there's no clear␣
↪elbow point,
it may imply that PCA is not suitable for dimensionality reduction in this␣
↪dataset,
[ ]: ##MODEL BUILDING
#Since,we have multicollinearity in data , it is better to use Lasso␣
↪Regression . But for exploration purpose ,we are performing both
[ ]: ---------------------------------------------------------------------------------------------
26
[67]: # Assuming X_train, X_test, y_train, y_test are your training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,␣
↪random_state=42)
[127]: df.dtypes
[135]: #Accuracy
from sklearn.metrics import r2_score
r2_linear = r2_score(y_test, y_pred_linear)
print("Linear Regression R-squared:", r2_linear)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[135], line 3
1 #Accuracy
2 from sklearn.metrics import r2_score
----> 3 r2_linear = r2_score(y_test, y_pred_linear)
4 print("Linear Regression R-squared:", r2_linear)
File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\_param_validation
↪py:213, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
207 try:
208 with config_context(
209 skip_parameter_validation=(
210 prefer_skip_nested_validation or global_skip_validation
27
211 )
212 ):
--> 213 return func(*args, **kwargs)
214 except InvalidParameterError as e:
215 # When the function is just a wrapper around an estimator, we allow
216 # the function to delegate validation to the estimator, but we␣
↪replace
217 # the name of the estimator by the name of the function in the error
218 # message to avoid confusion.
219 msg = re.sub(
220 r"parameter of \w+ must be",
221 f"parameter of {func.__qualname__} must be",
222 str(e),
223 )
File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_regression.
↪py:1180, in r2_score(y_true, y_pred, sample_weight, multioutput, force_finite)
1039 @validate_params(
1040 {
1041 "y_true": ["array-like"],
(…)
1059 force_finite=True,
1060 ):
1061 """:math:`R^2` (coefficient of determination) regression score␣
↪function.
1062
1063 Best possible score is 1.0 and it can be negative (because the
(…)
1178 -inf
1179 """
-> 1180 y_type, y_true, y_pred, multioutput = _check_reg_targets(
1181 y_true, y_pred, multioutput
1182 )
1183 check_consistent_length(y_true, y_pred, sample_weight)
1185 if _num_samples(y_pred) < 2:
File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_regression.
↪py:102, in _check_reg_targets(y_true, y_pred, multioutput, dtype)
28
103 y_true = check_array(y_true, ensure_2d=False, dtype=dtype)
104 y_pred = check_array(y_pred, ensure_2d=False, dtype=dtype)
File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.
↪py:457, in check_consistent_length(*arrays)
[ ]: ----------------------------------------------------------------------------------------------
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.metrics import mean_squared_error
lasso_reg = Lasso()
parameters = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
lasso_grid = GridSearchCV(lasso_reg, parameters,␣
↪scoring='neg_mean_squared_error', cv=5)
lasso_grid.fit(X_train, y_train)
y_pred_lasso = lasso_grid.predict(X_test)
rmse_lasso = mean_squared_error(y_test, y_pred_lasso, squared=False)
print("Lasso Regression RMSE:", rmse_lasso)
[120]:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[120], line 1
----> 1 r2_lasso = r2_score(y_test, y_pred_lasso)
29
2 print("Lasso Regression R-squared:", r2_lasso)
File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\_param_validation
↪py:213, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
207 try:
208 with config_context(
209 skip_parameter_validation=(
210 prefer_skip_nested_validation or global_skip_validation
211 )
212 ):
--> 213 return func(*args, **kwargs)
214 except InvalidParameterError as e:
215 # When the function is just a wrapper around an estimator, we allow
216 # the function to delegate validation to the estimator, but we␣
↪replace
217 # the name of the estimator by the name of the function in the error
218 # message to avoid confusion.
219 msg = re.sub(
220 r"parameter of \w+ must be",
221 f"parameter of {func.__qualname__} must be",
222 str(e),
223 )
File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_regression.
↪py:1180, in r2_score(y_true, y_pred, sample_weight, multioutput, force_finite)
1039 @validate_params(
1040 {
1041 "y_true": ["array-like"],
(…)
1059 force_finite=True,
1060 ):
1061 """:math:`R^2` (coefficient of determination) regression score␣
↪function.
1062
1063 Best possible score is 1.0 and it can be negative (because the
(…)
1178 -inf
1179 """
-> 1180 y_type, y_true, y_pred, multioutput = _check_reg_targets(
1181 y_true, y_pred, multioutput
1182 )
1183 check_consistent_length(y_true, y_pred, sample_weight)
1185 if _num_samples(y_pred) < 2:
30
File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_regression.
↪py:102, in _check_reg_targets(y_true, y_pred, multioutput, dtype)
File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.
↪py:457, in check_consistent_length(*arrays)
[ ]: ----------------------------------------------------------------------------------------------
[ ]: #CROSS VALIDATION
# Assuming X_train, X_test, y_train, y_test are your training and testing data
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)
31
test_preds = lasso.predict(X_test)
# Perform cross-validation
cv_scores = cross_val_score(lasso, X_train, y_train, cv=5,␣
↪scoring='neg_mean_squared_error')
cv_rmse = np.sqrt(-cv_scores)
C:\Users\lariy\AppData\Local\Temp\ipykernel_17032\3438038971.py:6: UserWarning:
color is redundantly defined by the 'color' keyword argument and the fmt string
32
"k--" (-> color='k'). The keyword argument will take precedence.
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--',
lw=2, color='red', label='Perfect Prediction')
[ ]: #CALCULATING RESIDUALS
[ ]: #checking assumptions
# Calculate residuals
residuals = y - predicted_values
33
[79]: from scipy.stats import shapiro
print("Shapiro-Wilk Test:")
print("Test Statistic:", shapiro_test_statistic)
print("p-value:", shapiro_p_value)
else:
print("p-value <= 0.05. Reject the null hypothesis. Residuals are not␣
↪normally distributed.")
Shapiro-Wilk Test:
Test Statistic: 0.5694072286447804
p-value: 2.730521207845291e-58
p-value <= 0.05. Reject the null hypothesis. Residuals are not normally
distributed.
Breusch-Pagan Test:
Test Statistic: 1100.672462731512
p-value: 2.3590049793082233e-241
34
[82]: import numpy as np
from scipy.stats import skew
35
# Plot for 'Oil Prices (USD)' variable
plt.subplot(3, 2, 6)
sns.histplot(data['Oil Prices (USD)'], kde=True)
plt.title('Distribution of Oil Prices (USD)')
plt.tight_layout()
plt.show()
print("Shapiro-Wilk Test:")
print("Test Statistic:", shapiro_test_statistic)
print("p-value:", shapiro_p_value)
36
print("p-value > 0.05. Fail to reject the null hypothesis. Residuals are␣
↪normally distributed.")
else:
print("p-value <= 0.05. Reject the null hypothesis. Residuals are not␣
↪normally distributed.")
Shapiro-Wilk Test:
Test Statistic: 0.45959720762702283
p-value: 2.042106007386736e-62
p-value <= 0.05. Reject the null hypothesis. Residuals are not normally
distributed.
Breusch-Pagan Test:
Test Statistic: 1100.672462731512
p-value: 2.3590049793082233e-241
[ ]: ------------------------------------------------------------------
37
model = sm.OLS(y, X).fit()
# Calculate residuals
residuals = model.resid
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 1.48e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
38
[ ]: #CROSS VALIDATION
# Assuming X_train and y_train are your training data and labels
# Instantiate the model
model = LinearRegression()
# Perform cross-validation
cv_scores = cross_val_score(model, X_train, y_train, cv=5) # You can adjust␣
↪the number of folds (cv) as needed
39
print("Training R-squared:", training_score)
print("Testing R-squared:", testing_score)
[ ]: ----------------------------------------------------------------------------------------------
[ ]: #TRYING MODELS
[ ]: #DECISION TREE
40
# Calculate Root Mean Squared Error (RMSE)
rmse_dt = np.sqrt(mse_dt)
[ ]: ----------------------------------------------------------------------------------------------
[ ]: #RANDOM FOREST
[ ]: ----------------------------------------------------------------------------------------------
[ ]: #GRADIENT BOOSTING
41
# Make predictions on training and testing data
y_train_pred_gb = gb_model.predict(X_train)
y_test_pred_gb = gb_model.predict(X_test)
# Assuming X_train, X_test, y_train, y_test are your training and testing data
# 1. Linear Regression
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)
42
# Calculate RMSE
y_pred_train_lr = linear_reg.predict(X_train)
rmse_train_lr = np.sqrt(mean_squared_error(y_train, y_pred_train_lr))
y_pred_test_lr = linear_reg.predict(X_test)
rmse_test_lr = np.sqrt(mean_squared_error(y_test, y_pred_test_lr))
# Cross-validation
cv_scores_lr = cross_val_score(linear_reg, X_train, y_train, cv=5,␣
↪scoring='neg_mean_squared_error')
cv_rmse_lr = np.sqrt(-cv_scores_lr)
# Print results
print("Linear Regression:")
print("Training RMSE:", rmse_train_lr)
print("Testing RMSE:", rmse_test_lr)
print("Mean CV RMSE:", cv_rmse_lr.mean())
print("Std CV RMSE:", cv_rmse_lr.std())
Linear Regression:
Training RMSE: 2.2829534657510607e-13
Testing RMSE: 2.3371014992128577e-13
Mean CV RMSE: 2.4273458808647473e-13
Std CV RMSE: 9.99530028073763e-14
[124]:
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\model_selection\_validation.py:547: FitFailedWarning:
170 fits failed out of a total of 500.
The score on these train-test partitions for these parameters will be set to
nan.
If these failures are not expected, you can try to debug them by setting
error_score='raise'.
43
validate_parameter_constraints(
File "C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\utils\_param_validation.py", line 95, in
validate_parameter_constraints
raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'max_features'
parameter of RandomForestRegressor must be an int in the range [1, inf), a float
in the range (0.0, 1.0], a str among {'sqrt', 'log2'} or None. Got 'auto'
instead.
warnings.warn(some_fits_failed_message, FitFailedWarning)
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\model_selection\_search.py:1051: UserWarning: One or more of
the test scores are non-finite: [ nan -58.93907181 -37.90943391
-39.08803166 -43.19203625
-41.3236575 nan nan nan -57.15259269
-9.60570762 -40.55572985 nan -58.59893245 nan
-73.39584413 -18.16182559 nan -11.13065831 -49.6359911
nan -48.98262037 -11.31604215 -11.67992611 -48.91244679
-15.17315845 -27.45118465 nan -38.86970611 nan
-46.46008978 -76.8895055 -58.00598485 nan nan
-44.38463655 -11.17309129 nan -47.82243001 -43.27158338
-74.06200261 nan nan -17.23503776 -25.69643983
-60.15686948 nan -79.96635582 -27.44276843 -62.70058757
-58.44250228 nan -64.96921744 nan -63.50868913
-64.67395609 -70.42213274 -50.12022647 nan -27.87982487
nan -71.32384802 -25.74420683 nan nan
-73.91215454 -62.70968467 -38.69771057 -71.87860269 nan
-47.75998313 nan -57.17101105 -69.30109927 -27.04269829
nan -21.32999161 nan -11.68845447 -71.50186947
-58.46591787 -63.41533483 nan -25.82496422 -54.44954262
nan -25.74575808 -33.52783331 -32.28189951 nan
-71.64371947 -42.34704253 nan -4.80991776 nan
nan -25.35147124 -79.21123837 nan nan]
warnings.warn(
Best Hyperparameters: {'max_depth': 30, 'max_features': 'log2',
'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 123}
Training RMSE: 0.7535148520604379
Testing RMSE: 1.1135615099471048
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\metrics\_regression.py:483: FutureWarning: 'squared' is
deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean
squared error, use the function'root_mean_squared_error'.
warnings.warn(
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\metrics\_regression.py:483: FutureWarning: 'squared' is
deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean
44
squared error, use the function'root_mean_squared_error'.
warnings.warn(
# Perform RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=rf_regressor,␣
↪param_distributions=param_dist, n_iter=100, cv=5,␣
↪scoring='neg_mean_squared_error', random_state=42)
random_search.fit(X_train, y_train)
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\model_selection\_validation.py:547: FitFailedWarning:
45
170 fits failed out of a total of 500.
The score on these train-test partitions for these parameters will be set to
nan.
If these failures are not expected, you can try to debug them by setting
error_score='raise'.
warnings.warn(some_fits_failed_message, FitFailedWarning)
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\model_selection\_search.py:1051: UserWarning: One or more of
the test scores are non-finite: [ nan -58.46487472 -37.58114991
-42.56400437 -43.44029062
-42.10161918 nan nan nan -57.94417192
-9.16763144 -36.80911199 nan -61.16860533 nan
-72.77088008 -17.58202262 nan -11.41052383 -47.83249256
nan -49.86612921 -11.77792633 -12.80862819 -48.15975327
-16.30022891 -26.89035096 nan -36.89072673 nan
-50.18467986 -75.53246766 -60.59819993 nan nan
-43.91990928 -11.54445493 nan -49.10973958 -45.45312132
-71.05456128 nan nan -16.48893719 -25.75054451
-60.00189991 nan -70.68815294 -27.23028137 -64.0287326
-56.70982844 nan -63.52681669 nan -63.1903081
-61.92191672 -70.84075316 -54.15822214 nan -27.03075124
nan -72.24201098 -27.96395719 nan nan
-73.31688285 -59.45094749 -37.26006078 -71.19256996 nan
-49.74931541 nan -56.2931076 -70.51918141 -26.51741668
nan -21.61383175 nan -11.67718781 -72.78182952
46
-57.23201728 -62.8046166 nan -27.12209161 -51.13025851
nan -27.13044707 -32.43912226 -32.90987822 nan
-69.76462899 -42.51182264 nan -4.74511793 nan
nan -26.98026314 -77.91433524 nan nan]
warnings.warn(
Best Hyperparameters: {'max_depth': 30, 'max_features': 'log2',
'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 123}
Training RMSE: 0.6656633581584913
Testing RMSE: 1.1766919161273477
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\metrics\_regression.py:483: FutureWarning: 'squared' is
deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean
squared error, use the function'root_mean_squared_error'.
warnings.warn(
C:\Users\lariy\AppData\Local\Programs\Python\Python310\lib\site-
packages\sklearn\metrics\_regression.py:483: FutureWarning: 'squared' is
deprecated in version 1.4 and will be removed in 1.6. To calculate the root mean
squared error, use the function'root_mean_squared_error'.
warnings.warn(
47