How do we find the best-fit line through a cloud of data points? The Least Squares Method holds the key! By minimizing errors, this simple yet powerful technique makes Linear Regression the go-to tool for predictions.
Problem Formulation
Given training data {(xn,tn)} for n=1,2,…,N where xn∈RD,tn∈R.
-
xn represents the nth training case, a vector containing the values {xn1,xn2,xn3,…,xnD}.
-
tn is the true label corresponding to the nth training case.
Our goal is to learn a function that maps these inputs to outputs f:xn→tn.
A simple assumption is that each tn can be expressed as the weighted combination of the predictor variables (after proper scaling) plus a bias term w0. The bias term is added to adjust the output to the range of tn. For N observations, this results in a system of linear equations:
A straight-forward approach to finding the coefficients w's would be to solve this system of linear equations. However, in practice, an exact solution often does not exist unless all observations lie on a straight line. Observations never lie on a straight line because of measurement errors. Consequently, we must adopt alternative approaches.
Least Squares Estimate
We aim to find a line (or hyperplane) that is as close as possible to all data points. This means learning the coefficients in a way that minimizes the distance between the predicted and actual values.
Consider a linear function (linear in parameters): yn=w0xn0+w1xn1+⋯+wDxnD=w⊤xn, where xn0=1 for all n and yn is the predicted value of our model using some weight vector w. Our goal is to make these predictions as close as possible to the actual values.
To achieve this, we learn w∈RD+1 by minimizing the sum of squared errors between tn and yn. The error E, a function of w, is known as the squared error. On dividing it by N, we get the mean squared error. But for mathematical convenience, we divide it by 2.
We need w such that the function MSE is minimum. This approach of finding the optimal w is known as least squares error method.
To optimize this, we can find the gradient of the objective function E(w) and equate it to 0.
Caution
|
In the above derivation, we assume that X⊤X is invertible. |
For linear regression, we have a closed-form expression for finding the optimal w as an analytic function of X and t.
Linear Basis Function Models
The simplest linear model for regression is one that involves a linear combination of the input variables (which we saw above)
where xn0=1. This model is often simply known as a linear regression model. The key property of this model is that it is a linear function of the parameters w0,w1,…,wD. It is also, however, a linear function of the input variables x0,x1,…,xD, and this imposes significant limitations on the model. We therefore extend the class of models by considering linear combinations of fixed nonlinear functions of the input variables, of the form
where ϕj(x):RD→R are known as basis functions. Note that the total parameters of this model is M. In matrix notation, the model can be written as
Where
By using nonlinear basis functions, we allow the function y(x,w) to be a nonlinear function of the input vector x. But still functions of the above form are called linear models, because this function is linear in parameters w.
Example: Polynomial Regression
Say we are given a dataset with a single feature and a target {(xn,tn)} for n=1,2,…,N, where xn,tn∈R. And we see that t and x are related in a non-linear manner. If we assume that the function is a Mth order polynomial, then we can write the function involving one variable as:
Here the basis functions are ϕ0(x)=1,ϕ1(x)=x,ϕ2(x)=x2,ϕM(x)=xM. Here we map an input x∈R to a (M+1)-dimensional space RM+1. On considering all the training data cases n=1 to N, we can write the design matrix Φ as:
The squared error is:
Then our optimal w is
Geometric Interpretation of Least Squares
In general, we have M features (number of columns), each of which can be represented through basis functions ϕj(x).
Given these features, we are doing a linear combination of these features to get out output t.
We cannot find a w that satisfies all these N equations, so we find the coefficients in such a way that it minimizes the difference between Φw and t. And we end up learning our coefficients as:
On substituting the value in the previous equation, we get:
We have actually applied a matrix transformation Φ(Φ⊤Φ)−1Φ⊤ to t to get our predicted output.
The least squares regression function is obtained by finding the orthogonal projection of the N-dimensional output vector t onto the subspace spanned by the basis functions ϕj(x) vectors that are N-dimensional vectors. Since there are only M vectors and typically M<N, it cannot span the whole space, it spans only a subspace.

References
-
Christopher M. Bishop Pattern Recognition and Machine Learning. Springer (2006)
-
Kevin P. Murphy Machine Learning: a Probabilistic Perspective, the MIT Press (2012), also see upcoming 2021 editions https://probml.github.io/pml-book/