Linear Regression from scratch (analytical solution)

This short article shows how to estimate a multiple linear regression from scratch. We will use Orinary Least Square (OLS) which is the standard method to solve this problem.

Parameter $\beta$ :

$\beta = (X^TX)^{-1} X^T\vec{y}$ where X and y are matrices.

Let’s start by importing numpy and pandas. Using pandas is optional, you can run this tutorial without it.

import sys
import numpy as np
import scipy.stats as stats
import statsmodels.api as sm

import pandas as pd
We now generate dummy data.

x1, x2 and x3 are random variable we will use to build Y :

x1 = np.random.randn(100)
x2 = np.random.randn(100)
x3 = np.random.randn(100)
Y is defined as a linear relationship between x1, x2 and x3 :

Coefficients can be changed.

Y = 0.2*x1 + 2*x2 + 7*x3 + 1.2
Y = Y + (np.random.randn(100)/1) # adding some noise
Let’s build our matrix X. We add a column of ones np.ones(100) to take the intercept into account. If you don’t want to fit the intercept, you can use [x1, x2, x3] :
FIT_INTERCEPT = True

if FIT_INTERCEPT:
    BigX = np.array([np.ones(100), x1, x2, x3]).T
else:
    BigX = np.array([x1, x2, x3]).T
BigX.shape
(100, 4)
pd.DataFrame(BigX).head(10)
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

We are now able to compute coefficients thanks to the formula.

Let’s start by the first part of the equation $(X^TX)$.
.T is used to transpose a matrix and .dot() returns a product of two matrices :

xTx = BigX.T.dot(BigX)

np.linalg is used to compute the inverse of the above expression $(X^TX)^{-1} $ :

XtX = np.linalg.inv(xTx)

And $(X^TX)^{-1} X^T $ :

XtX_xT = XtX.dot(BigX.T)
XtX_xT.shape
(4, 100)

Finally, to get $\beta$, we need to multiply the complete expression by $Y \Rightarrow (X^TX)^{-1} X^T\vec{y}$ :

theta = XtX_xT.dot(Y)
theta
array([1.18285547, 0.19657548, 2.08816915, 6.95718985])

SSR, SSE and SSTO

To facilitate the understanding of the model, we will use this terminology :

AbbreviationValue
TotalT$y-\bar{y}$
RegressionReg$\hat{y}-\bar{y}$
ResidualRes$y-\hat{y}$

On this model, we will derivate SSR, SSE and SSTO. SSR relates to the regression, SSE relates to residuals and SSTO is associated to the Total.

Yest = theta[1] * x1 + theta[2] * x2 + theta[3] * x3 + theta[0]
residuals = Y - Yest
Ybar = Y.mean()

$SSR$ is the “regression sum of squares” :

$$SSR = \sum_{i=1}^n(\hat{y}_{i}-\bar{y})^2$$

$\bar{y}$ represents the no relationship line.

SSR = ((Yest - Ybar)**2).sum()
SSR
3542.4265948501607

$SSE$ is the “error sum of squares” and quantifies how much the data points, $y_{i}$, vary around the estimated regression line, $\hat{y}_{i}$:
$SSE = \sum_{i=1}^n(y_{i}-\hat{y}_{i})^2$

SSE = ((Y - Yest)**2).sum()
SSE
92.51266827522522

$SSTO$ is the “total sum of squares” and quantifies how much the data points, $y_{i}$, vary around their mean, $\bar{y}$.
$SSTO = \sum_{i=1}^n(y_{i}-\bar{y})^2$

SSTO = ((Y - Ybar)**2).sum()
SSTO
3634.9392631253872

Note that $SSTO = SSE + SSR$ :

SSE + SSR
3634.939263125386

The $R^2$ illustrates how much variance our model is able to explain compared to the total variance. SSR is the variance of our model and SSTO is the total sum of squares. Hence :

$R^2 = \frac{SSR}{SSTO} = 1 - \frac{SSE}{SSTO}$

r2 = SSR/SSTO
r2
0.9745490470188263

A regression that describes a strong relationship hence has an SSR which tend to be as high as the SSTO.
SSTO describe the total variance of the data, SSR is the variance explained by the regression and SSE is the random (or unexplained) variations.
$\frac{SSR}{SSTO}$ is the proportion of the variance explained by the model ($R^2$).

We can also compute the Multiple R. It is $\sqrt(r^2)$ :

r = r2**(1/2) 
r
0.9871925075783478

MSR, MSE and MSTO

MS stands for Mean Squares. These values are computed from the Sum of Squares. They are calculated as the Sum of Squares divided by the Degrees of Freedom. First, degrees of freedom have to be deduced from the data :

Degrees of FreedomSum of SquaresMean Squares
T$n-1$$\sum_{i=1}^n(y_{i}-\bar{y})^2$$SS / DF$
Reg$nReg$$\sum_{i=1}^n(\hat{y}_{i}-\bar{y})^2$$SS / DF$
Res$n-nReg-1$$\sum_{i=1}^n(y_{i}-\hat{y}_{i})^2$$SS / DF$

We get $nReg$ :

if FIT_INTERCEPT:
    nReg = BigX.shape[1] - 1
else:
    nReg = BigX.shape[1]
nReg
3
Degrees of Freedom

Let’s define Dfs :

obs = BigX.shape[0]
dfReg = nReg
dfRes =  obs - dfReg - 1
dfT = dfReg + dfRes
dfreedom = np.array([dfReg, dfRes, dfT])
print(dfreedom)
[ 3 96 99]
MSR = SSR / dfReg
MSE = SSE / dfRes
MSTO = SSTO / dfT
Adjusted R-Square

The adjusted R-Square is given by $1-(1-r^2) * \frac{obs - 1}{(obs - dfReg - 1)}$
As you add variables to a model, the classic R-Square increases. This adjusted R-Square allows to compare the fit of several models with different number of variables.

adj_r2 = 1-(1-r2) * (obs - 1) / (obs - dfReg -1)
adj_r2
0.9737537047381646
F-Stat and the associated probability

Finally, we can compute the F-Stat for later :

FStat = MSR / MSE
FStat
1225.32030638189
FSignificance = stats.f.sf(FStat, dfReg, dfRes)
FSignificance
2.3160988815694793e-76
Standard Errors

To compute Standard Errors for the coefficients, we need to calculate $MS_{res}(X^TX)^{-1}$ :

XtX
array([[ 1.03530285e-02, -1.22255499e-03, -1.57253899e-03,
         1.99307876e-05],
       [-1.22255499e-03,  1.22362943e-02,  1.37815484e-03,
         2.04673533e-03],
       [-1.57253899e-03,  1.37815484e-03,  9.93350559e-03,
         2.28207357e-03],
       [ 1.99307876e-05,  2.04673533e-03,  2.28207357e-03,
         1.43514233e-02]])
MSE * XtX
array([[ 9.97694056e-03, -1.17814400e-03, -1.51541436e-03,
         1.92067744e-05],
       [-1.17814400e-03,  1.17917942e-02,  1.32809148e-03,
         1.97238486e-03],
       [-1.51541436e-03,  1.32809148e-03,  9.57265737e-03,
         2.19917412e-03],
       [ 1.92067744e-05,  1.97238486e-03,  2.19917412e-03,
         1.38300882e-02]])

For example, the standard error of the intercept is:

(MSE * XtX)[0,0]**(1/2)
0.09988463624670003

We can get all Standard Errors :

se = np.diagonal(MSE * XtX)**(1/2)
se
array([0.09988464, 0.10859003, 0.09783996, 0.1176014 ])
t Stat

t Stats are found by computing Coefficients on Standard Errors :

tStat = theta/se
tStat
array([11.84221635,  1.81025355, 21.34270291, 59.15907573])
P-values :
pval = stats.t.sf(np.abs(tStat), dfRes)*2
pval
array([1.77074072e-20, 7.33849971e-02, 3.19662627e-38, 2.43379392e-77])
Lower / Upper 95% :
lower95 = theta - stats.t.isf(0.025,dfRes)*se
upper95 = theta + stats.t.isf(0.025,dfRes)*se
lower95
array([ 0.98458604, -0.01897402,  1.89395837,  6.72375292])
upper95
array([1.38112491, 0.41212498, 2.28237994, 7.19062677])
Comparing with the statmodels package
results = sm.OLS(Y, BigX).fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.975
Model:                            OLS   Adj. R-squared:                  0.974
Method:                 Least Squares   F-statistic:                     1225.
Date:                Mon, 22 Jan 2018   Prob (F-statistic):           2.32e-76
Time:                        12:47:08   Log-Likelihood:                -138.00
No. Observations:                 100   AIC:                             284.0
Df Residuals:                      96   BIC:                             294.4
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.1829      0.100     11.842      0.000       0.985       1.381
x1             0.1966      0.109      1.810      0.073      -0.019       0.412
x2             2.0882      0.098     21.343      0.000       1.894       2.282
x3             6.9572      0.118     59.159      0.000       6.724       7.191
==============================================================================
Omnibus:                        0.798   Durbin-Watson:                   1.705
Prob(Omnibus):                  0.671   Jarque-Bera (JB):                0.897
Skew:                          -0.128   Prob(JB):                        0.639
Kurtosis:                       2.614   Cond. No.                         1.45
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We can check our values match with the modelstats summary.