Data Final Regression

Data final regression
December 11, 2023
1 Panel OLS
[32]: import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
data = pd.read_csv('/Users/isaacbougherab/Desktop/nour1_1.csv')
data_reset = data.reset_index()
formula = 'SP ~ ROA + Leverage + div + independecy + gender_composition + size␣

↪+ Q("board size") + C(company_id) + C(year_id)'
model = smf.ols(formula, data=data_reset).fit()
print(model.summary())
OLS Regression Results

==============================================================================
Dep. Variable: SP R-squared: 0.836
Model: OLS Adj. R-squared: 0.808
Method: Least Squares F-statistic: 29.75
Date: Mon, 11 Dec 2023 Prob (F-statistic): 1.08e-84
Time: 17:18:05 Log-Likelihood: -2323.8
No. Observations: 329 AIC: 4746.
Df Residuals: 280 BIC: 4932.
Df Model: 48
Covariance Type: nonrobust
================================================================================
=======
coef std err t P>|t| [0.025
0.975]
--------------------------------------------------------------------------------
-------
Intercept -2568.4228 989.657 -2.595 0.010 -4516.535
-620.310
C(company_id)[T.2] 676.2726 319.952 2.114 0.035 46.455
1306.090
1
C(company_id)[T.3] 1696.7247 228.659 7.420 0.000 1246.616
2146.833
C(company_id)[T.4] 959.7391 243.509 3.941 0.000 480.398
1439.080
C(company_id)[T.5] 682.1991 321.635 2.121 0.035 49.069
1315.330
C(company_id)[T.6] 1784.4294 200.859 8.884 0.000 1389.044
2179.815
C(company_id)[T.7] 579.2368 199.973 2.897 0.004 185.595
972.878
C(company_id)[T.8] 726.8245 211.744 3.433 0.001 310.012
1143.637
C(company_id)[T.9] 431.4618 208.545 2.069 0.039 20.946
841.977
C(company_id)[T.10] 980.8839 323.823 3.029 0.003 343.448
1618.320
C(company_id)[T.11] 747.9023 409.535 1.826 0.069 -58.256
1554.061
C(company_id)[T.12] 71.7132 449.902 0.159 0.873 -813.907
957.333
C(company_id)[T.13] 526.5121 242.000 2.176 0.030 50.143
1002.882
C(company_id)[T.14] 252.8241 192.842 1.311 0.191 -126.781
632.429
C(company_id)[T.15] 792.2757 384.911 2.058 0.040 34.589
1549.962
C(company_id)[T.16] 1265.7232 583.550 2.169 0.031 117.021
2414.425
C(company_id)[T.17] 802.2170 420.656 1.907 0.058 -25.833
1630.267
C(company_id)[T.18] 1032.9546 207.091 4.988 0.000 625.303
1440.607
C(company_id)[T.19] 501.1618 278.328 1.801 0.073 -46.719
1049.042
C(company_id)[T.20] 895.4253 218.496 4.098 0.000 465.322
1325.529
C(company_id)[T.21] 559.2011 186.371 3.000 0.003 192.334
926.068
C(company_id)[T.22] 539.3729 192.305 2.805 0.005 160.826
917.919
C(company_id)[T.23] 581.4894 232.680 2.499 0.013 123.466
1039.513
C(company_id)[T.24] 1280.7350 308.707 4.149 0.000 673.054
1888.416
C(company_id)[T.25] 217.5453 200.044 1.087 0.278 -176.236
611.327
C(company_id)[T.26] 96.4090 185.117 0.521 0.603 -267.989
460.807
2
C(company_id)[T.27] 392.1751 184.590 2.125 0.034 28.815
755.535
C(company_id)[T.28] 3252.6314 231.511 14.050 0.000 2796.908
3708.355
C(company_id)[T.29] 1101.5438 306.117 3.598 0.000 498.962
1704.126
C(company_id)[T.30] 177.4431 171.402 1.035 0.301 -159.957
514.843
C(company_id)[T.31] 857.0456 379.330 2.259 0.025 110.345
1603.746
C(company_id)[T.32] -31.4485 320.476 -0.098 0.922 -662.297
599.400
C(company_id)[T.33] 517.0795 272.877 1.895 0.059 -20.072
1054.231
C(year_id)[T.2014] -35.3306 76.691 -0.461 0.645 -186.294
115.633
C(year_id)[T.2015] 4.8012 76.507 0.063 0.950 -145.800
155.403
C(year_id)[T.2016] 42.2214 76.462 0.552 0.581 -108.291
192.734
C(year_id)[T.2017] 129.3250 76.690 1.686 0.093 -21.637
280.287
C(year_id)[T.2018] 151.9537 78.648 1.932 0.054 -2.863
306.771
C(year_id)[T.2019] 159.2207 78.128 2.038 0.042 5.428
313.013
C(year_id)[T.2020] 181.6642 80.725 2.250 0.025 22.759
340.569
C(year_id)[T.2021] 192.7232 82.584 2.334 0.020 30.159
355.288
C(year_id)[T.2022] 180.6225 85.513 2.112 0.036 12.293
348.952
ROA -1.6138 1.592 -1.014 0.312 -4.747
1.520
Leverage -0.8246 0.162 -5.082 0.000 -1.144
-0.505
div 161.1431 92.554 1.741 0.083 -21.047
343.333
independecy -1.5112 1.548 -0.976 0.330 -4.559
1.537
gender_composition 14.7329 3.572 4.125 0.000 7.702
21.763
size 315.0929 115.371 2.731 0.007 87.989
542.197
Q("board size") -28.5152 24.609 -1.159 0.248 -76.957
19.926
==============================================================================
Omnibus: 160.132 Durbin-Watson: 0.637
3
Prob(Omnibus): 0.000 Jarque-Bera (JB): 8650.525
Skew: -1.185 Prob(JB): 0.00
Kurtosis: 28.008 Cond. No. 3.85e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The condition number is large, 3.85e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
1.1 Significance figure

[23]: import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
specified_variables = ['ROA', 'Leverage', 'div', 'independecy',␣

↪'gender_composition', 'size', 'Q("board size")']
p_values = model.pvalues
coeff_names = p_values.index
filtered_p_values = p_values[specified_variables]
filtered_names = coeff_names.intersection(specified_variables)
significance_dict = {0.001: '***', 0.01: '**', 0.05: '*', 0.1: '.'}

significance_labels = [next((symbol for threshold, symbol in significance_dict.
↪items() if p < threshold), '') for p in filtered_p_values]
sorted_indices = np.argsort(filtered_p_values)
sorted_p_values = filtered_p_values[sorted_indices]
sorted_names = filtered_names[sorted_indices]
sorted_significance = [significance_labels[i] for i in sorted_indices]
plt.figure(figsize=(8, 6))
bars = plt.bar(sorted_names, sorted_p_values, color='skyblue')
plt.xticks(rotation=45)
plt.xlabel('Variables')
plt.ylabel('p-values')
plt.title('P-values and Significance Levels')
plt.axhline(y=0.05, color='r', linestyle='--', linewidth=0.7)
for bar, label in zip(bars, sorted_significance):

yval = bar.get_height()
4
plt.text(bar.get_x() + bar.get_width() / 2.0, yval, label, ha='center',␣
↪ va='bottom', fontsize=8)
plt.tight_layout()
plt.show()
1.2 Multicolineraty test

[24]: def latex_safe_print(df):
print(df.to_string().replace('_', '\_'))
from statsmodels.stats.outliers_influence import variance_inflation_factor
X = data[['ROA', 'Leverage', 'div', 'independecy', 'gender_composition',␣

↪'size', 'board size']]
X = sm.add_constant(X)
vifdata = pd.DataFrame()
vifdata["Variable"] = X.columns
5
vifdata["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.
↪shape[1])]
latex_safe_print(vifdata)
Variable VIF
0 const 47.315123
1 ROA 1.069206
2 Leverage 1.226521
3 div 1.381452
4 independecy 1.351410
5 gender\_composition 1.151373
6 size 2.252770
7 board size 2.021347
const: The VIF value for the constant is very high (47.315123), but multicollinearity is not usually
a concern for the constant.
ROA: With a VIF of 1.069206, this variable shows no sign of significant multicollinearity.
Leverage: A VIF of 1.226521 indicates the absence of multicollinearity problems for this variable.
div: The VIF is 1.381452, which is also well below the common threshold of 5 or 10, suggesting
that there is no multicollinearity of concern.
independecy: With a VIF of 1.351410, this variable does not seem to suffer from multicollinearity.
gender composition: The VIF value of 1.151373 is low, indicating that there are no multicollinearity
issues here.
size: A VIF of 2.252770 is slightly higher than the other variables, but remains well below the
threshold of concern, indicating an absence of serious multicollinearity.
board size: The VIF of 2.021347 also suggests that this variable does not suffer from significant
multicollinearity.
1.3 Heteroscedasticity test

[25]: import statsmodels.api as sm
from statsmodels.compat import lzip
from statsmodels.stats.diagnostic import het_breuschpagan
[26]: from statsmodels.stats.diagnostic import het_breuschpagan
residuals = model.resid
exog = model.model.exog
bp_test = het_breuschpagan(residuals, exog)
labels = ['Lagrange multiplier statistic', 'p-value', 'f-value', 'f p-value']
6
bp_test_rounded = dict(zip(labels, [round(val, 2) for val in bp_test]))
print(bp_test_rounded)
{'Lagrange multiplier statistic': 241.89, 'p-value': 0.0, 'f-value': 16.2, 'f

p-value': 0.0}
The Lagrange multiplier statistic and associated p-value indicate whether the null hypothesis of
homoscedasticity (constancy of error variance) is rejected or not. In this case, with an extremely
low p-value (close to 0), the null hypothesis is strongly rejected. This suggests the presence of
heteroscedasticity in the model. The F-statistic and F p-value are another way of testing for het-
eroscedasticity. A very low F p-value also confirms rejection of the null hypothesis of homoscedas-
ticity.
1.4 Residual normality test

[27]: from statsmodels.stats.stattools import jarque_bera
jb_test = jarque_bera(residuals)
print(f"Jarque-Bera test statistic: {jb_test[0]}, p-value: {jb_test[1]}")
Jarque-Bera test statistic: 8650.52523664575, p-value: 0.0

The Jarque-Bera test statistic is very high, and the p-value is indicated as 0.0, meaning that it is
extremely small. Similarly, this leads to the rejection of the null hypothesis that the residuals are
normally distributed.
1.5 Test for independence of residuals

[28]: from statsmodels.stats.stattools import durbin_watson
residuals = model.resid
dw_statistic = durbin_watson(residuals)
print(f'Durbin-Watson statistic: {dw_statistic}')
Durbin-Watson statistic: 0.6374937845810602

less than 2, indicating the presence of positive autocorrelation in the residuals of your regression
model. In simpler terms, this suggests that the errors (residuals) in your model are not independent
of each other: one error is likely to be followed by an error of similar sign.
7
2 OLS demonstration
[29]: import pandas as pd
import statsmodels.api as sm
import numpy as np
from scipy.stats import t
company_dummies = pd.get_dummies(data['company_id'], prefix='company',␣

↪drop_first=True)
year_dummies = pd.get_dummies(data['year_id'], prefix='year', drop_first=True)

X_original = data[['ROA', 'Leverage', 'div', 'independecy',␣
↪'gender_composition', 'size', 'board size']]
company_dummies = company_dummies.astype(int)
year_dummies = year_dummies.astype(int)
X = pd.concat([X_original, company_dummies, year_dummies], axis=1)

X = np.hstack([np.ones((X.shape[0], 1)), X])
Y = data['SP']
#X'X
XTX = np.dot(X.T, X)
#(X'X)^-1
XTX_inv = np.linalg.inv(XTX)
#X'Y
XTY = np.dot(X.T, Y)
beta = np.dot(XTX_inv, XTY)
n, k = X.shape
y_pred = np.dot(X, beta)
residuals = Y - y_pred
mse = np.dot(residuals.T, residuals) / (n - k)
se = np.sqrt(np.diagonal(mse * XTX_inv))
t_values = beta / se
p_values = 2 * (1 - t.cdf(np.abs(t_values), n - k))
#R²
8
ss_total = np.sum((Y - np.mean(Y))**2)
ss_res = np.sum(residuals**2)
r_squared = 1 - ss_res / ss_total
#R² ajusté
r_squared_adj = 1 - (1 - r_squared) * (n - 1) / (n - k)
results_df = pd.DataFrame({
'Coefficient': beta,
'Std. Error': se,
't-value': t_values,
'P>|t|': p_values
}, index=['const'] + list(X_original.columns) + list(company_dummies.columns) +␣
↪list(year_dummies.columns))
additional_stats = {
'R-squared': r_squared,
'Adj. R-squared': r_squared_adj,
'No. Observations': n
}
results_df['P>|t|'] = results_df['P>|t|'].round(4)
print(results_df)
print("\nAdditional Statistics:")
for stat, value in additional_stats.items():
print(f"{stat}: {value:.3f}")
Coefficient Std. Error t-value P>|t|

const -2568.422803 989.656993 -2.595266 0.0100
ROA -1.613838 1.591824 -1.013829 0.3115
Leverage -0.824615 0.162247 -5.082480 0.0000
div 161.143113 92.553849 1.741074 0.0828
independecy -1.511207 1.548282 -0.976054 0.3299
gender_composition 14.732937 3.571554 4.125078 0.0000
size 315.092883 115.370743 2.731133 0.0067
board size -28.515198 24.608642 -1.158747 0.2475
company_2 676.272621 319.952268 2.113667 0.0354
company_3 1696.724665 228.658683 7.420338 0.0000
company_4 959.739138 243.508958 3.941289 0.0001
company_5 682.199094 321.635452 2.121032 0.0348
company_6 1784.429428 200.858879 8.883996 0.0000
company_7 579.236788 199.972972 2.896575 0.0041
company_8 726.824526 211.743972 3.432563 0.0007
company_9 431.461767 208.545273 2.068912 0.0395
company_10 980.883877 323.822560 3.029078 0.0027
company_11 747.902329 409.535046 1.826223 0.0689
9
company_12 71.713155 449.902047 0.159397 0.8735
company_13 526.512143 241.999528 2.175674 0.0304
company_14 252.824070 192.842361 1.311040 0.1909
company_15 792.275716 384.910750 2.058336 0.0405
company_16 1265.723225 583.549915 2.169006 0.0309
company_17 802.217004 420.655878 1.907062 0.0575
company_18 1032.954570 207.090541 4.987937 0.0000
company_19 501.161788 278.327716 1.800618 0.0728
company_20 895.425312 218.496163 4.098128 0.0001
company_21 559.201149 186.371414 3.000466 0.0029
company_22 539.372896 192.304671 2.804783 0.0054
company_23 581.489429 232.679506 2.499100 0.0130
company_24 1280.735002 308.706780 4.148710 0.0000
company_25 217.545348 200.044300 1.087486 0.2778
company_26 96.408979 185.117340 0.520799 0.6029
company_27 392.175069 184.589913 2.124575 0.0345
company_28 3252.631383 231.511421 14.049550 0.0000
company_29 1101.543797 306.116541 3.598446 0.0004
company_30 177.443145 171.401973 1.035246 0.3014
company_31 857.045569 379.329949 2.259367 0.0246
company_32 -31.448507 320.475929 -0.098131 0.9219
company_33 517.079466 272.877189 1.894916 0.0591
year_2014 -35.330639 76.690696 -0.460690 0.6454
year_2015 4.801188 76.506748 0.062755 0.9500
year_2016 42.221361 76.461543 0.552191 0.5813
year_2017 129.324965 76.689925 1.686336 0.0928
year_2018 151.953732 78.648230 1.932068 0.0544
year_2019 159.220694 78.127940 2.037948 0.0425
year_2020 181.664185 80.724961 2.250409 0.0252
year_2021 192.723196 82.584161 2.333658 0.0203
year_2022 180.622451 85.512668 2.112230 0.0356
Additional Statistics:
R-squared: 0.836
Adj. R-squared: 0.808
No. Observations: 329.000
[ ]:
10

Data Final Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Final Regression

Uploaded by

Copyright:

Available Formats

Data final regression

December 11, 2023

formula = 'SP ~ ROA + Leverage + div + independecy + gender_composition + size␣

model = smf.ols(formula, data=data_reset).fit()

OLS Regression Results

1.1 Significance figure

specified_variables = ['ROA', 'Leverage', 'div', 'independecy',␣

significance_dict = {0.001: '***', 0.01: '**', 0.05: '*', 0.1: '.'}

for bar, label in zip(bars, sorted_significance):

1.2 Multicolineraty test

X = data[['ROA', 'Leverage', 'div', 'independecy', 'gender_composition',␣

1.3 Heteroscedasticity test

[26]: from statsmodels.stats.diagnostic import het_breuschpagan

bp_test = het_breuschpagan(residuals, exog)

labels = ['Lagrange multiplier statistic', 'p-value', 'f-value', 'f p-value']

{'Lagrange multiplier statistic': 241.89, 'p-value': 0.0, 'f-value': 16.2, 'f

1.4 Residual normality test

print(f"Jarque-Bera test statistic: {jb_test[0]}, p-value: {jb_test[1]}")

Jarque-Bera test statistic: 8650.52523664575, p-value: 0.0

1.5 Test for independence of residuals

print(f'Durbin-Watson statistic: {dw_statistic}')

Durbin-Watson statistic: 0.6374937845810602

company_dummies = pd.get_dummies(data['company_id'], prefix='company',␣

year_dummies = pd.get_dummies(data['year_id'], prefix='year', drop_first=True)

X = pd.concat([X_original, company_dummies, year_dummies], axis=1)

beta = np.dot(XTX_inv, XTY)

y_pred = np.dot(X, beta)

mse = np.dot(residuals.T, residuals) / (n - k)

p_values = 2 * (1 - t.cdf(np.abs(t_values), n - k))

Coefficient Std. Error t-value P>|t|

You might also like

significance_dict = {0.001: '', 0.01: '', 0.05: '', 0.1: '.'}