# Multiple Regression

In this notebook we implement the multiple regression in Python, separately for the case of having 2 independent input variables and 3 independent input variables in the regression model.

### Preparation

In order to implement the regression, we first import the required Python modules.

import math
import numpy as np                           # used for numerical computations
import matplotlib.pyplot as plt
import ipyvolume as ipv
from mpl_toolkits.mplot3d import Axes3D      # plotting library
from scipy.optimize import curve_fit         # implementation of the regression
from scipy.stats.stats import pearsonr       # used for siginificant test

plt.rcParams['figure.figsize'] = [10, 10]  # This is to set the size of the plots


## Part 1: Multiple Regression (linear case)

Goal of a multiple linear regression is to estimate the parameters of the best fitting (hyper)plane to some given data (observations). Our first example is the case of having 2 independent input variables $x$ and $y$ in the regression model, which basically describes a plane in 3D space:

This is formulated in the next cell.

def f(data, b0, b1, b2):
x = data[:, 0]
y = data[:, 1]
return b0 + b1*x + b2*y


### Creating a Toy Dataset

In the next step, we will create a toy dataset with and without noisy samples drawn from a known distribution separately. Adding noise is to find out how the estimation will perform with noisy samples.

np.random.seed(0)                       # We use a fixed random seed

beta = -5.853, 1.234, 3.567             # Ground truth regression parameters (vector)

x = np.random.random((100, 1))*10       # Input / indicator samples (as column vectors)
y = np.random.random((100, 1))*10
data = np.hstack((x, y))

z = f(data, *beta).reshape(100, 1)      # Corresponding output variable (true plane)

z_noise = z + np.random.randn(100, 1)*5 # Corresponding output variable with added noise


### Estimation of regression coefficients

Here, we show the process of coefficient estimation using least squares adjustment. First, we describe the relation between indicator variables and outputs as matrix multiplication:

Then, the estimated regression coefficients are computed via

In the next cell, we implement this equation and compute the estimated parameters for both the noisy data and the data without noise.

A = np.hstack((np.ones_like(x), x, y))

B = z
beta_reg = np.linalg.inv(A.T@A)@A.T@B

B = z_noise
beta_noise = np.linalg.inv(A.T@A)@A.T@B

print('Coefficients:')
print('True:              ', beta)
print('Estimated w/o noise', beta_reg[:,0])
print('Estimated w/ noise ', beta_noise[:,0])

Coefficients:
True:               (-5.853, 1.234, 3.567)
Estimated w/o noise [-5.853  1.234  3.567]
Estimated w/ noise  [-4.31647347  0.95817723  3.43757662]


Comparing ground truth and estimated regression coefficients

we see that the estimated regression coefficients based on noisy data samples are quite close to the ground truth. The estimated ones without noise are equal to ground truth.

### Visualization

For visualization, we draw the true and estimated plane. Since there is no difference between the estimated parameters without noise and the correct plane, we only draw the one estimated based on noisy data.

def draw_plane(ax, xlim, ylim, beta, color, label):
X, Y = np.meshgrid(np.arange(*xlim), np.arange(*ylim))
data = np.hstack((X.reshape(-1,1), Y.reshape(-1,1)))
Z = f(data, *beta).reshape(X.shape)
ax.plot_wireframe(X, Y, Z, color=color, label=label)

plt.figure()
ax = plt.gca(projection='3d')
ax.scatter(x,y,z_noise,color='b',label='Data points')
draw_plane(ax,[0.,11.], [0.,11.], beta, 'k', 'True plane')
draw_plane(ax,[0.,11.], [0.,11.], beta_noise, 'r', 'Regressed plane')

ax.set_xlabel('x'); ax.set_ylabel('y'); ax.set_zlabel('z')
ax.legend()

plt.show()


At this point we can also use ipvolume to create an interactive visualization.

def draw_plane_ipv(xlim, ylim, beta, color):
X, Y = np.meshgrid(np.arange(*xlim), np.arange(*ylim))
data = np.hstack((X.reshape(-1,1), Y.reshape(-1,1)))
Z = f(data, *beta).reshape(X.shape)
ipv.plot_wireframe(X, Y, Z, color=color)

ipv.figure()
ipv.scatter(x.ravel(), y.ravel(), z_noise.ravel(), color="blue")
draw_plane_ipv([0.,11.], [0.,11.], beta, 'black')
draw_plane_ipv([0.,11.], [0.,11.], beta_noise, 'red')
ipv.show()


### Correlation analysis

The correlation coefficient is calculated via,

where the vector $c = (r_{xz,}r_{yz})^T$ contains the correlations between the independent variables x, y and the independent variable z, and the matrix $R_{xx}^{-1}$ is the inverse of the correlation matrix and which contains correlation between independent variables x and y,

correlations = np.corrcoef(np.hstack((x,y,z)),rowvar=False)
c = correlations[:2,2:3:]
Rxx = correlations[:2,:2]
R2 = c.T @ np.linalg.inv(Rxx) @ c
print('R2 w/o noise', R2[0,0])

R2 w/o noise 1.0000000000000002


Our dataset without noise gets a correlation of approximately 1. It means that the dependent variable z can be well predicted by the independent variables x and y using a linear function (which is quite obvious since the dependency is in fact linear).

correlations = np.corrcoef(np.hstack((x,y,z_noise)),rowvar=False)
c = correlations[:2,2:3:]
Rxx = correlations[:2,:2]
R2 = c.T @ np.linalg.inv(Rxx) @ c
print('R2 w noise',R2[0,0])

R2 w noise 0.819882735182887


Dataset with noise results a correlation of 0.82 which is still quite close to 1. The lower correlation implies that the plane is not fitting the data points very well, but not perfectly.

## Part 2: Multiple Regression for estimating a non-linear surface

In this example the target variable $z$ depends on two independent input variables $x$ and $y$ in a non-linear sense.

Our goal is again to find the coefficients that minimize the squared errors between observations and predictions. The underlying function is implemented next:

def f(data, b0, b1, b2):
x = data[:,0]
y = data[:,1]
return b0 + b1*x**2 + b2*x*y


### Creating a Toy Dataset

In the next step we will generate a dataset with known coefficients and add some white noise to it.

np.random.seed(0)                   # set fixed random seed

beta =  1.834, 0.256, 0.345         # ground truth coefficient vector
b0, b1, b2 = beta

x = np.random.random((50, 1))*10
y = np.random.random((50, 1))*10
data = np.hstack((x, y))

z = f(data, *beta)                  # output variables..
z = z + np.random.randn(50)*5       # .. with added noise


### Estimation of regression coefficients

This time we use the scipy function curve_fit for the least squares optimization. The function return the optimal parameters as well as a covariance matrix of the estimated parameters.

beta_reg, covar = curve_fit(f, data, z)

print('Coefficients:')
print('True:        ', beta)
print('Estimated:   ', beta_reg)
print('Covariance:')
print(covar)

Coefficients:
True:         (1.834, 0.256, 0.345)
Estimated:    [4.00078368 0.24977405 0.35208127]
Covariance:
[[ 1.55615396e+00 -2.03205499e-02 -1.28086925e-02]
[-2.03205499e-02  9.43694227e-04 -6.58669900e-04]
[-1.28086925e-02 -6.58669900e-04  1.73090509e-03]]


Comparing the ground truth and the estimated regression coefficients

we see, that the estimated parameters match the true parameters quite well.

### Visualization

For visualization, we can draw the real and estimated surfaces.

plt.figure()
ax = plt.gca(projection='3d')
ax.scatter(x, y, z, color='b',label='Data points')
draw_plane(ax,[0.,11.], [0.,11.], beta, 'k', 'True surface')
draw_plane(ax,[0.,11.], [0.,11.], beta_reg, 'r', 'Regressed surface')

ax.set_xlabel('x'); ax.set_ylabel('y'); ax.set_zlabel('z')
ax.legend()

plt.show()


And again using ipvolume.

ipv.figure()
ipv.scatter(x.ravel(), y.ravel(), z.ravel(), color="blue")
draw_plane_ipv([0.,11.], [0.,11.], beta, 'black')
draw_plane_ipv([0.,11.], [0.,11.], beta_reg, 'red')
ipv.show()