Edit Template

ML001 – Linear Regression – Least Squares Fit

How do we find the best-fit line through a cloud of data points? The Least Squares Method holds the key! By minimizing errors, this simple yet powerful technique makes Linear Regression the go-to tool for predictions.

Problem Formulation

Given training data {(xn,tn)} for n=1,2,,N where xnRD,tnR.

  • xn represents the nth training case, a vector containing the values {xn1,xn2,xn3,,xnD}.

  • tn is the true label corresponding to the nth training case.

Our goal is to learn a function that maps these inputs to outputs f:xntn.

A simple assumption is that each tn can be expressed as the weighted combination of the predictor variables (after proper scaling) plus a bias term w0. The bias term is added to adjust the output to the range of tn. For N observations, this results in a system of linear equations:

t1=w1x11++wDx1D+w0t2=w1x21++wDx2D+w0tN=w1xN1++wDxND+w0

A straight-forward approach to finding the coefficients w's would be to solve this system of linear equations. However, in practice, an exact solution often does not exist unless all observations lie on a straight line. Observations never lie on a straight line because of measurement errors. Consequently, we must adopt alternative approaches.

Least Squares Estimate

We aim to find a line (or hyperplane) that is as close as possible to all data points. This means learning the coefficients in a way that minimizes the distance between the predicted and actual values.

Consider a linear function (linear in parameters): yn=w0xn0+w1xn1++wDxnD=wxn, where xn0=1 for all n and yn is the predicted value of our model using some weight vector w. Our goal is to make these predictions as close as possible to the actual values.

To achieve this, we learn wRD+1 by minimizing the sum of squared errors between tn and yn. The error E, a function of w, is known as the squared error. On dividing it by N, we get the mean squared error. But for mathematical convenience, we divide it by 2.

E(w)=Nn=1(tnyn)2=Nn=1(tnxnw)2=12Nn=1(tnxnw)2=12(tXw)(tXw)=12(twX)(tXw)=12(tttXwwXt+wXXw)=12(tt2tXw+wXXw)

We need w such that the function MSE is minimum. This approach of finding the optimal w is known as least squares error method.

w=argminwE(w)

To optimize this, we can find the gradient of the objective function E(w) and equate it to 0.

E(w)=E(w)w=012(2tX+wXX+XXw)=0tX+wXX=0wXX=tXw=tX(XX)1w=(XX)1Xt
Caution
In the above derivation, we assume that XX is invertible.

For linear regression, we have a closed-form expression for finding the optimal w as an analytic function of X and t.

Linear Basis Function Models

The simplest linear model for regression is one that involves a linear combination of the input variables (which we saw above)

yn(xn,w)=w0xn0+w1xn1++wDxnD=wxn

where xn0=1. This model is often simply known as a linear regression model. The key property of this model is that it is a linear function of the parameters w0,w1,,wD. It is also, however, a linear function of the input variables x0,x1,,xD, and this imposes significant limitations on the model. We therefore extend the class of models by considering linear combinations of fixed nonlinear functions of the input variables, of the form

yn(xn,w)=w0ϕ0(x)+w1ϕ1(x)++wM1ϕM1(x)=w0ϕ0(x)+M1j=1wjϕj(x)=M1j=0wjϕj(x)

where ϕj(x):RDR are known as basis functions. Note that the total parameters of this model is M. In matrix notation, the model can be written as

yn(xn,w)=M1j=0wjϕj(x)=wϕ(x)

Where

w=[w0w1wM1]RM and ϕ(x)=[ϕ0(x)ϕ1(x)ϕM1(x)]RM

By using nonlinear basis functions, we allow the function y(x,w) to be a nonlinear function of the input vector x. But still functions of the above form are called linear models, because this function is linear in parameters w.

Example: Polynomial Regression

Say we are given a dataset with a single feature and a target {(xn,tn)} for n=1,2,,N, where xn,tnR. And we see that t and x are related in a non-linear manner. If we assume that the function is a Mth order polynomial, then we can write the function involving one variable as:

yn(xn,w)=Mj=0wjϕj(x)=Mj=0wjxj

Here the basis functions are ϕ0(x)=1,ϕ1(x)=x,ϕ2(x)=x2,ϕM(x)=xM. Here we map an input xR to a (M+1)-dimensional space RM+1. On considering all the training data cases n=1 to N, we can write the design matrix Φ as:

Φ:=[ϕ(x1)ϕ(x2)ϕ(xN)]=[ϕ0(x1)ϕ1(x1)ϕM(x1)ϕ0(x2)ϕ1(x2)ϕM(x2)ϕ0(xN)ϕ1(xN)ϕM(xN)]RN×(M+1)

The squared error is:

E(w)=Nn=1(tnyn)2=Nn=1(tnwϕ(xn))2=(tΦw)(tΦw)

Then our optimal w is

w=argminwE(w)=argminw(tΦw)(tΦw)=(ΦΦ)1Φt

Geometric Interpretation of Least Squares

In general, we have M features (number of columns), each of which can be represented through basis functions ϕj(x).

Φ:=[ϕ0(x1)ϕ1(x1)ϕM1(x1)ϕ0(x2)ϕ1(x2)ϕM1(x2)ϕ0(xN)ϕ1(xN)ϕM1(xN)]RN×M

Given these features, we are doing a linear combination of these features to get out output t.

yN=ΦN×MwM

We cannot find a w that satisfies all these N equations, so we find the coefficients in such a way that it minimizes the difference between Φw and t. And we end up learning our coefficients as:

w=(ΦΦ)1Φt

On substituting the value in the previous equation, we get:

y=Φ(ΦΦ)1Φt

We have actually applied a matrix transformation Φ(ΦΦ)1Φ to t to get our predicted output.

The least squares regression function is obtained by finding the orthogonal projection of the N-dimensional output vector t onto the subspace spanned by the basis functions ϕj(x) vectors that are N-dimensional vectors. Since there are only M vectors and typically M<N, it cannot span the whole space, it spans only a subspace.

lse geometric view

References

  1. Christopher M. Bishop Pattern Recognition and Machine Learning. Springer (2006)

  2. Kevin P. Murphy Machine Learning: a Probabilistic Perspective, the MIT Press (2012), also see upcoming 2021 editions https://probml.github.io/pml-book/

Share Article:

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like:

From Equations to Insights: The Science Behind data-driven AI and ML Models

Quick Links