Here we explore the theoretical coefficient distributions from a linear regression model. When fitting a regression model we can get estimates for the standard deviation of the coefficients. We use bootstrapping to get an empiracle distribution of the regression coefficients to compare against those distributions.

The outline is as follows:

  • Collect some data
  • Fit a regression model
  • Fit many models on bootstrapped samples of the training data
  • Compare the distribution of bootstrapped coefficients to the theoretical distribution of coefficients

The data used is weather data collect from kaggle: https://www.kaggle.com/budincsevity/szeged-weather

import os
import pandas as pd
import numpy as np

dir_path = os.getcwd()
data = pd.read_csv(os.path.join(dir_path, "data", "weatherHistory.csv"))
data.head()
Formatted Date Summary Precip Type Temperature (C) Apparent Temperature (C) Humidity Wind Speed (km/h) Wind Bearing (degrees) Visibility (km) Loud Cover Pressure (millibars) Daily Summary
0 2006-04-01 00:00:00.000 +0200 Partly Cloudy rain 9.472222 7.388889 0.89 14.1197 251.0 15.8263 0.0 1015.13 Partly cloudy throughout the day.
1 2006-04-01 01:00:00.000 +0200 Partly Cloudy rain 9.355556 7.227778 0.86 14.2646 259.0 15.8263 0.0 1015.63 Partly cloudy throughout the day.
2 2006-04-01 02:00:00.000 +0200 Mostly Cloudy rain 9.377778 9.377778 0.89 3.9284 204.0 14.9569 0.0 1015.94 Partly cloudy throughout the day.
3 2006-04-01 03:00:00.000 +0200 Partly Cloudy rain 8.288889 5.944444 0.83 14.1036 269.0 15.8263 0.0 1016.41 Partly cloudy throughout the day.
4 2006-04-01 04:00:00.000 +0200 Mostly Cloudy rain 8.755556 6.977778 0.83 11.0446 259.0 15.8263 0.0 1016.51 Partly cloudy throughout the day.

Next we do some data preparation. We create a model to predict apparent temperature using most of the remaining columns:

# Calculate apprent temperature
X = data.loc[:, ['Temperature (C)',
       'Humidity', 'Wind Speed (km/h)',
       'Wind Bearing (degrees)', 'Visibility (km)', 'Loud Cover',
       'Pressure (millibars)']]
X = X.join(pd.get_dummies(data['Precip Type']))

# add constant column
X['Constant'] = 1

y = data.loc[:, 'Apparent Temperature (C)']

X.head()
Temperature (C) Humidity Wind Speed (km/h) Wind Bearing (degrees) Visibility (km) Loud Cover Pressure (millibars) rain snow Constant
0 9.472222 0.89 14.1197 251.0 15.8263 0.0 1015.13 1 0 1
1 9.355556 0.86 14.2646 259.0 15.8263 0.0 1015.63 1 0 1
2 9.377778 0.89 3.9284 204.0 14.9569 0.0 1015.94 1 0 1
3 8.288889 0.83 14.1036 269.0 15.8263 0.0 1016.41 1 0 1
4 8.755556 0.83 11.0446 259.0 15.8263 0.0 1016.51 1 0 1

We reduce the training data size just to enlarge the coefficient distributino for the purpose of this article.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.9)
X_train.reset_index(inplace=True, drop=True)
y_train.reset_index(inplace=True, drop=True)
X_train.shape
(9645, 10)

Statsmodels approach

With statsmodels we can apply the ordinary least squares solution to the above data to recover estimates of the model coefficients. When we fit a linear regression model the Hessian (2nd order derivatives) determines how sensitive the coefficients are to changes in the data. This in turn is used to calculate the standard deviation of the coefficients.

import statsmodels.api as sm

results = sm.OLS(y_train, X_train).fit()
print(results.summary())
# extract coefficient distributions
w_sm_mu = results.params
w_sm_std = np.sqrt(np.diag(results.normalized_cov_params))
OLS Regression Results                               
====================================================================================
Dep. Variable:     Apparent Temperature (C)   R-squared:                       0.990
Model:                                  OLS   Adj. R-squared:                  0.990
Method:                       Least Squares   F-statistic:                 1.175e+05
Date:                      Thu, 11 Jun 2020   Prob (F-statistic):               0.00
Time:                              00:00:29   Log-Likelihood:                -14401.
No. Observations:                      9645   AIC:                         2.882e+04
Df Residuals:                          9636   BIC:                         2.888e+04
Df Model:                                 8                                         
Covariance Type:                  nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Temperature (C)            1.1138      0.002    607.061      0.000       1.110       1.117
Humidity                   0.9269      0.078     11.828      0.000       0.773       1.081
Wind Speed (km/h)         -0.0919      0.002    -54.661      0.000      -0.095      -0.089
Wind Bearing (degrees)     0.0006      0.000      5.584      0.000       0.000       0.001
Visibility (km)           -0.0082      0.003     -2.805      0.005      -0.014      -0.002
Loud Cover              2.722e-15   1.54e-15      1.764      0.078   -3.03e-16    5.75e-15
Pressure (millibars)       0.0002   9.24e-05      2.550      0.011    5.45e-05       0.000
rain                      -0.3603      0.186     -1.942      0.052      -0.724       0.003
snow                      -0.8371      0.189     -4.432      0.000      -1.207      -0.467
Constant                  -1.9649      0.224     -8.783      0.000      -2.403      -1.526
==============================================================================
Omnibus:                      240.963   Durbin-Watson:                   1.987
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              262.505
Skew:                           0.378   Prob(JB):                     9.95e-58
Kurtosis:                       3.287   Cond. No.                     3.57e+20
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 7.96e-32. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

The above model coefficients suggest that the apparent temperature is mostly made up of the actual temperature. Humidity has a positive relationship and wind a negative relationship.

Bootstrapping approach

To get an empirical idea of the distribution of the regression coefficients we can refit the model on many bootstrapped samples. A bootstrap sample is when we take random samples from our training set, with replacement. If we take 1000 bootstrap samples and fit a model on each we gain 1000 estimates of the regression coefficients.

We do this as follows:

w_bs = []
n = X_train.shape[0]
for i in range(1000):
    samp = np.random.randint(n, size=n)
    results_bs = sm.OLS(y_train.loc[samp], X_train.loc[samp,:]).fit()
    w_bs.append(results_bs.params)
w_bs = np.array(w_bs)

# summarise coefficient distributions
w_bs_mu = np.mean(w_bs, axis=0)
w_bs_std = np.std(w_bs, axis=0)

Results comparison

We can now compare the coefficient distributions to the theoretical results from statsmodels.

Below we plot the coefficient means and standard deviations from both statsmodels and the bootstrapped models.

import matplotlib.pyplot as plt
plt.style.use("seaborn-whitegrid")

fig, ax = plt.subplots(ncols=2, figsize=(10,6))
ax[0].plot(range(X.shape[1]), w_sm_mu, label='statsmodels')
ax[0].plot(range(X.shape[1]), w_bs_mu, 'x', label='boostrapped')
ax[0].set_ylabel('Mean')

ax[1].plot(range(X.shape[1]), w_sm_std, label='statsmodels')
ax[1].plot(range(X.shape[1]), w_bs_std, 'x', label='boostrapped')
ax[1].set_ylabel('Standard deviation')

plt.legend()
plt.show()

svg

The above figures show that both the means and standard deviations are close. This suggests the coefficient distribution from the theory are likely correct. The numbers should converge with more bootstrap samples.

I would expect the bootstrapped coefficients to maybe over estimate the coefficient standard deviations - each bootstrap sample will have fewer unique data entries to fit the model, as such the coefficients may not be as tightly fit. It doesn’t seem like we have enough evidence to clarify this.

The raw result numbers are shown below:

coefficients = pd.concat([w_sm_mu,
                        pd.DataFrame(data=w_bs_mu, index=X_train.columns),
                        pd.DataFrame(data=w_sm_std, index=X_train.columns),
                        pd.DataFrame(data=w_bs_std, index=X_train.columns)], axis=1)

coefficients.columns = ['statsmodels_mu', 'bootstrapped_mu', 'statsmodels_std', 'bootstrapped_std']
coefficients
statsmodels_mu bootstrapped_mu statsmodels_std bootstrapped_std
Temperature (C) 1.113766e+00 1.113721e+00 1.702782e-03 1.919617e-03
Humidity 9.269036e-01 9.255212e-01 7.272990e-02 8.587625e-02
Wind Speed (km/h) -9.193476e-02 -9.207228e-02 1.560984e-03 2.599598e-03
Wind Bearing (degrees) 5.787597e-04 5.797346e-04 9.620203e-05 1.074969e-04
Visibility (km) -8.162228e-03 -8.096304e-03 2.701086e-03 2.871113e-03
Loud Cover 2.722203e-15 2.385971e-17 1.432384e-15 2.841603e-15
Pressure (millibars) 2.355312e-04 2.313375e-04 8.573335e-05 9.545615e-05
rain -3.603448e-01 -3.645398e-01 1.721736e-01 1.573641e-01
snow -8.370791e-01 -8.412777e-01 1.753044e-01 1.671947e-01
Constant -1.964852e+00 -1.954745e+00 2.076308e-01 2.028931e-01