In this post I talk about reformulating linear regression in a Bayesian framework. This gives us the notion of epistemic uncertainty which allows us to generate probabilistic model predictions. I formulate a model class which can perform linear regression via Bayes rule updates. We show the results are the same as from the statsmodels library. I will also show some of the benefits of the sequential bayesian approach.

import pandas as pd
import numpy as np
import statsmodels.api as sm
import time
from scipy.stats import norm
import matplotlib.pyplot as plt

plt.style.use("seaborn-whitegrid")


## Generate some dummy data

We generate some ideal data for a linear regression model so that we know the true coefficients for this example.

n = 10000
m = 2
np.random.seed(3)

noise_std = 2

x = np.random.uniform(-1, 1, size=(n, m))
x = np.hstack([np.ones(shape=(n, 1)), x])
w = np.random.uniform(-1, 1, size=(m + 1, 1))
y_true = x @ w
y = y_true + np.random.normal(loc=0, scale=noise_std, size=(n, 1))

print("True coefficients:\n", w)

True coefficients:
[[-0.24831899]
[ 0.99863554]
[-0.88431078]]


## Fit a linear regression with statsmodels

The implementation of linear regression in statsmodels returns the standard error of the fitted coefficients.

results = sm.OLS(y, x).fit()
print(results.summary())

# extract coefficient distributions
w_sm_mu = results.params
w_sm_std = np.sqrt(np.diag(results.normalized_cov_params))

                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.138
Method:                 Least Squares   F-statistic:                     797.0
Date:                Mon, 07 Jun 2021   Prob (F-statistic):          6.67e-322
Time:                        23:10:31   Log-Likelihood:                -21011.
No. Observations:               10000   AIC:                         4.203e+04
Df Residuals:                    9997   BIC:                         4.205e+04
Df Model:                           2
Covariance Type:            nonrobust
==============================================================================
coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.2520      0.020    -12.736      0.000      -0.291      -0.213
x1             0.9965      0.034     29.195      0.000       0.930       1.063
x2            -0.9231      0.034    -26.849      0.000      -0.990      -0.856
==============================================================================
Omnibus:                        1.539   Durbin-Watson:                   1.980
Prob(Omnibus):                  0.463   Jarque-Bera (JB):                1.554
Skew:                          -0.016   Prob(JB):                        0.460
Kurtosis:                       2.948   Cond. No.                         1.74
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


If we received extra data we would have no choice but to retrain the model from scratch. In the following we train the model with an increasing fraction of the data and time each instance.

idx = np.floor(np.linspace(0, x.shape[0], num=50))[1:]
time_iter = []

params_mu = []
params_std = []
for end_idx in idx:
t0 = time.process_time()
results = sm.OLS(y[: int(end_idx)], x[: int(end_idx)]).fit()
time_iter.append(time.process_time() - t0)

params_mu.append(results.params)
params_std.append(np.sqrt(np.diag(results.normalized_cov_params)))

params_mu = pd.DataFrame(params_mu, index=idx)
params_std = pd.DataFrame(params_std, index=idx, columns=params_mu.columns)


As expected, we can see that training the model with increasing lengths of data causes an increasing in fitting time.

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(idx, time_iter, label="statsmodels")
ax.set_ylabel("Time taken (s)")
ax.set_xlabel("No. training rows")
plt.legend()
plt.show()


We can also show that with increasing data our estimates of the coefficients become more confident as the standard error shrinks. This is a measure of our epistemic (model) uncertainty.

fig, ax = plt.subplots(ncols=2, figsize=(10, 6))
params_mu.plot(ax=ax[0])
params_std.plot(ax=ax[1])
ax[0].set_xlabel("No. training rows")
ax[0].set_ylabel("Coefficient means")
ax[1].set_ylabel("Coefficient standard dev.")
plt.show()


## Sequential bayesian regression

Another way of approaching this problem is with sequential Bayesian regression. This method follows Bayes theorem, where we have a prior distribution or estimates of our regression coefficients. We then update those prior distribution with data we observe to get a posterior distribution. We can then consider those posterior distributions as our new prior distributions and repeat the process.

Linear regression produces a multivariate normal distribution over the resulting coefficient estimates. The conjugate prior to this is also a multivariate normal distribution prior. As such we can formulate an analytical expression for the Bayes rule posterior update.

The update rules I used were taken from: https://cedar.buffalo.edu/~srihari/CSE574/Chap3/3.4-BayesianRegression.pdf

### Model definition

We can build this as a class withi a similar API to sklearn models:

class BayesLinearRegressor:
def __init__(
self, number_of_features, mean=None, cov=None, alpha=1e6, beta=1
):
# prior distribution on weights
if mean is None:
self.mean = np.array([[0] * (number_of_features)], dtype=np.float).T

if cov is None:
self.cov = alpha * np.identity(number_of_features)
self.cov_inv = np.linalg.inv(self.cov)

self.beta = beta  # process noise

def fit(self, x, y):
return self.update(x, y)

def update(self, x, y):
"""
Perform a bayesian update step
"""
if len(x.shape) == 1:
x = x[np.newaxis, :]
if len(y.shape) == 1:
y = y[:, np.newaxis]

# update state of covariance and means
cov_n_inv = self.cov_inv + self.beta * x.T @ x
cov_n = np.linalg.inv(cov_n_inv)
mean_n = cov_n @ (self.cov_inv @ self.mean + self.beta * x.T @ y)

self.cov_inv = cov_n_inv
self.cov = cov_n
self.mean = mean_n

def predict(self, x):
mean = x @ self.mean
scale = np.sqrt(np.sum(x @ self.cov @ x.T, axis=1))
return mean, scale

@property
def coef_(self):
return self.mean

@property
def scale_(self):
return np.sqrt(np.diag(self.cov))



We can train the model as follows. We use numpy testing to confirm the coefficient we get are equal to those of statsmodels.

bayes_linear_regression = BayesLinearRegressor(x.shape[1])
bayes_linear_regression.fit(x, y)
np.testing.assert_array_almost_equal(
bayes_linear_regression.coef_, params_mu.tail(1).transpose().to_numpy()
)
np.testing.assert_array_almost_equal(
bayes_linear_regression.scale_, params_std.tail(1).to_numpy().flatten()
)

ipykernel_launcher:7: DeprecationWarning: np.float is a deprecated alias for the builtin float. To silence this warning, use float by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.float64 here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations


### Sequential training

We can now train the model with chunks of data. We will time the models fitting and track the coefficients with extra data.

bayes_linear_regression = BayesLinearRegressor(x.shape[1])

time_iter_seq = []
params_mu_seq = []
params_std_seq = []
for i, end_idx in enumerate(idx):
t0 = time.process_time()
if i > 0:
start_idx = int(idx[i - 1])
else:
start_idx = 0
bayes_linear_regression.update(
x[start_idx : int(end_idx)],
y[start_idx : int(end_idx)],
)
time_iter_seq.append(time.process_time() - t0)

params_mu_seq.append(bayes_linear_regression.coef_.flatten())
params_std_seq.append(bayes_linear_regression.scale_)

params_mu_seq = pd.DataFrame(
params_mu_seq, index=idx, columns=params_mu.columns
)
params_std_seq = pd.DataFrame(
params_std_seq, index=idx, columns=params_mu.columns
)

ipykernel_launcher:7: DeprecationWarning: np.float is a deprecated alias for the builtin float. To silence this warning, use float by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.float64 here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations


The time taken to train with each chunk is approximately the same. As such, we can fit an initial model and continue to update it as an alternative to fitting a single large model.

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(idx, time_iter, label="statsmodels")
ax.plot(idx, time_iter_seq, label="sequential")
# ax.plot(idx, np.cumsum(time_iter_seq), label="cumulative_sequential")
ax.set_ylabel("Time taken (s)")
ax.set_xlabel("No. training rows")
plt.legend()
plt.show()


The model coefficients behave as before. E.g. we can similarly show that the standard deviation of the coefficients reduces with data length.

fig, ax = plt.subplots(ncols=2, figsize=(10, 6))
params_mu_seq.plot(ax=ax[0])
params_std_seq.plot(ax=ax[1])
ax[0].set_xlabel("No. training rows")
ax[0].set_ylabel("Coefficient means")
ax[1].set_ylabel("Coefficient standard dev.")
plt.show()


### Posteriors vs Priors One of the benefits of using bayesian linear regression is the ability to apply prior distributions on the model coefficients. To demonstrate this we use a prior with a much smaller variance, as such it is no longer uninformed. We fit the model and plot the pdf of the prior and posterior. The posteriors evidently converge to the true coefficients and have a tight distribution.

bayes_linear_regression = BayesLinearRegressor(x.shape[1], alpha=0.5)
prior_mu = bayes_linear_regression.coef_
prior_std = bayes_linear_regression.scale_
bayes_linear_regression.fit(x, y)

def norm_max(x):
return x / x.max()

x_range = np.linspace(-3, 3, num=1000)
fig, ax = plt.subplots(nrows=m + 1, figsize=(10, 6))
for idx in range(m + 1):
ax[idx].plot(
x_range,
norm_max(norm.pdf(x_range, loc=prior_mu[idx], scale=prior_std[idx])),
)
ax[idx].plot(
x_range,
norm_max(
norm.pdf(
x_range,
loc=params_mu_seq.iloc[-1, idx],
scale=params_std_seq.iloc[-1, idx],
)
),
)
ax[idx].set_ylabel(f"$~P(z_{idx})$")

ipykernel_launcher:7: DeprecationWarning: np.float is a deprecated alias for the builtin float. To silence this warning, use float by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.float64 here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations


### Prediction uncertainty

The distribution of our coefficients gives us a distribution for our model predictions as well.

The predict method of the BayesLinearRegressor class returns a standard deviation for each point. We can then plot a few of those points with this confidence shaded. This only represent epistemic uncertainty - i.e. uncertainty from our model coefficients, not uncertainty from the data generating process.

bayes_linear_regression = BayesLinearRegressor(x.shape[1])
bayes_linear_regression.fit(x, y)
pred_mu, pred_scale = bayes_linear_regression.predict(x[:100])

sort_id = np.argsort(pred_mu.flatten())

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(pred_mu[sort_id].flatten(), ".")

ax.fill_between(
np.arange(100),
pred_mu[sort_id].flatten() - pred_scale[sort_id],
pred_mu[sort_id].flatten() + pred_scale[sort_id],
alpha=0.3,
)

ipykernel_launcher:7: DeprecationWarning: np.float is a deprecated alias for the builtin float. To silence this warning, use float by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.float64 here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

<matplotlib.collections.PolyCollection at 0x7fb5b43b88d0>