Edit Template

ML003 – MLE Estimation

Ever wondered how we find the best possible parameters for a model? MLE - the gold standard for parameter estimation.

Estimation

When given a dataset, we often seek to model the distribution from which the data is generated. This involves two key steps:

  1. Model Selection: Choosing an appropriate distribution (or model) that best fits the data.

  2. Parameter Estimation: Estimating the parameters of the chosen distribution (for parametric models).

Any statistic used to estimate an unknown parameter, denoted as \(\theta\), is called an estimator of \(\theta\). The estimate \(\hat{\boldsymbol{\theta}}\) is a random variable that approximates \(\boldsymbol{\theta}\) based on the data. There are four primary techniques for parameter estimation:

  1. Unbiased estimators

  2. Maximum Likelihood Estimation (MLE): A method that estimates the parameters of a distribution by maximizing the likelihood of the observed data.

  3. Maximum A Posteriori estimation (MAP)

  4. Bayesian estimation

Unbiased Estimators

Let \((X_1, X_2, \dots, X_n)\) be \(n\) independent and identically distributed (i.i.d) random variables, where each \(X_i\) is drawn from a distribution \(F\) with expected value \(\mathrm{E}[X_i] = \mu\) and variance \(\mathrm{Var}(X_i) = \sigma^2\). Two common unbiased estimators are:

  1. Sample mean: \(\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i\) is an unbiased estimator of \(\mu\).

  2. Sample variance: \(S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2\) is an unbiased estimator of \(\sigma^2\).

These estimators can be applied to data from any distribution to estimate its mean and variance. For example, if we assume the data follows a Gaussian distribution, we can use these estimators to estimate its parameters. However, not all distributions have mean and variance as their parameters, making this approach limited in its general applicability to parameter estimation.

Likelihood of Data

Consider a sample of \(n\) independent and identically distributed (i.i.d.) random variables \((X_1, X_2, \dots, X_n)\). Since this forms a random vector, the joint probability density function (PDF) or probability mass function (PMF) of these random variables is given by: \(f(X_1, X_2, \dots, X_n)\). Since \(X_i\)'s are identical and independent, their distribution can be characterized by a parameter \(\theta\), and the joint distribution can be written as:

\[f(X_1, X_2, \dots, X_n \, | \, \theta) = \prod_{i=1}^{n} f(X_i \, | \, \theta)\]
  • For discrete distributions, we consider the probability of \(x_i\), \(f(X_i) = P(X_i = x_i)\)

  • For continuous distributions, we consider the density value at \(x_i\), \(f(X_i) = f(X_i=x_i)\)

Observed Data Example:

Suppose each \(X_i\) was drawn from a distribution \(\texttt{Ber}(p)\) with an unknown parameter \(p\). Then the PMF of each \(X_i\) is \(f(X_i=x_i) = p^{x_i} (1-p)^{(1- x_i)}\), which means \(f(X_i =1) = p \text{ and } f(X_i =0) = 1-p\).

Let \(n=10\) and suppose we observe the data: \([0,0,1,1,1,1,1,1,1,1]\). For the given data, we get:

\[\begin{align*} f([0,0,1,1,1,1,1,1,1,1] | p) & = \prod_{i=1}^{n} f(X_i \, | \, p) \\ & = \prod_{i=1}^{n} p^{x_i} (1-p)^{(1- x_i)} \\ & = p^8 (1-p)^2 \end{align*}\]

The above function is called the likelihood function, denoted as \(L(p)\). For each parameter value \(p\), we compute the likelihood. Thus, the likelihood function is a function of \(p\), given the observed data. By plotting the likelihood for different values of \(p\), we gain insights such as: "if \(p=0.4\), how likely is the observed data?"

How to interpret the likelihood?

For example,

\[L(p = 0.4) = 0.000236\]

This means that the likelihood of observing the given sample when \(p=0.4\) is 0.000236.

Warning
Likelihoods do not have to sum or integrate to 1, unlike probabilities. Instead, they help us compare how well different parameter values explain the observed data.

MLE Estimation

Consider a sample of \(n\) i.i.d random variables \((X_1, X_2, \dots, X_n)\). Each \(X_i\) is drawn from a distribution with density (or mass) function \(f(x_i |\, \boldsymbol{\theta})\), where \(\boldsymbol{\theta} = (\theta_1, \theta_2, \dots, \theta_k)\). The PMF or PDF gives the likelihood of a single data point \(x_i\) given parameter \(\boldsymbol{\theta}\). Here we are given a set of observed i.i.d data points \((x_1, x_2, \dots, x_n)\) and we seek to answer:

How likely is the observed data \((x_1, x_2, \dots, x_n)\) given parameter \(\boldsymbol{\theta}\)?

The likelihood of all the observed data points is given by the likelihood function, denoted by \(L(\boldsymbol{\theta})\), which is defined as

\[L(\boldsymbol{\theta}) = f(x_1, x_2, \dots, x_n | \boldsymbol{\theta}) = \prod_{i=1}^{n} f(X_i=x_i | \boldsymbol{\theta})\]
Note
As the data points are i.i.d, we can multiply their probabilities.

The Maximum Likelihood Estimator (MLE) of \(\boldsymbol{\theta}\) is the value of \(\boldsymbol{\theta}\) that maximizes \(L(\boldsymbol{\theta})\), i.e., the value that makes the observed data look as likely as possible.

\[\hat{\boldsymbol{\theta}}_{\text{MLE}} = \underset{\boldsymbol{\theta}}{\mathrm{arg max}}\, L(\boldsymbol{\theta})\]

We know that \(\log\) is a strictly increasing function (i.e.,) if \(x_1 < x_2 \Leftrightarrow \log x_1 < \log x_2\). So for all \(x_1, x_2\) where \(x_1, x_2 >0 \text{ and } x_1 \ne x_2\), if \(f(x_1) < f(x_2)\), then \(\log f(x_1) < \log f(x_2)\), for any function \(f\).

Taking log (natural log) on both sides of the likelihood equation,

\[LL(\boldsymbol{\theta}) = \log L(\boldsymbol{\theta}) = \log \left( \prod_{i=1}^{n} f(X_i=x_i | \boldsymbol{\theta}) \right) = \sum_{i=1}^n \log f(X_i=x_i | \boldsymbol{\theta})\]

Since log is monotonic, maximizing \(L(\boldsymbol{\theta})\) is equivalent to maximizing \(LL(\boldsymbol{\theta})\), i.e., the argument that maximizes \(L(\boldsymbol{\theta})\) is the same as the argument that maximizes \(LL(\boldsymbol{\theta})\). We often maximize the log likelihood for numerical stability.

\[\hat{\boldsymbol{\theta}}_{\text{MLE}} = \underset{\boldsymbol{\theta}}{\mathrm{arg max}}\, L(\boldsymbol{\theta}) = \underset{\boldsymbol{\theta}}{\mathrm{arg max}}\, LL(\boldsymbol{\theta})\]

Solving the Optimization Problem

There are various methods to solve this optimization problem, and the choice of method depends on the specific problem at hand. A simple approach, when feasible, is to set the derivative to zero and solve for the optimal value.

Since the log-likelihood function is often easier to differentiate than the likelihood function, we work with the log-likelihood instead.

To find the MLE, we differentiate \(LL(\boldsymbol{\theta})\) with respect to each parameter \(\theta_i\).

\[\frac{\partial}{\partial \theta_1} LL(\boldsymbol{\theta}), \frac{\partial}{\partial \theta_2} LL(\boldsymbol{\theta}), \dots , \frac{\partial}{\partial \theta_k} LL(\boldsymbol{\theta})\]

Setting each derivative to zero gives the system of equations:

\[\begin{align*} \frac{\partial}{\partial \theta_1} LL(\boldsymbol{\theta}) & = \textbf{0} \\ \frac{\partial}{\partial \theta_2} LL(\boldsymbol{\theta}) & = \textbf{0} \\ \dots \\ \frac{\partial}{\partial \theta_k} LL(\boldsymbol{\theta}) & = \textbf{0} \end{align*}\]

This results in \(k\) equations with \(k\) unknowns, which can be solved algebraically or computationally to obtain \(\hat{\theta}_{\text{MLE}}\) for each \(\theta_i\). Thus, we obtain the parameter values \(\boldsymbol{\theta}\) that maximize the likelihood of observing the given data.

Caution

Setting derivatives to zero does not always guarantee a maximum. In some cases, additional checks (e.g., second derivative tests) may be required. Furthermore, when an analytical solution is difficult or impossible, we may need to use numerical optimization techniques, such as:

  • Gradient Descent

  • Newton-Raphson Method

  • Expectation-Maximization (EM) Algorithm (for latent variable models)

These iterative methods help find the MLE when closed-form solutions are intractable.

Share Article:

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like:

From Equations to Insights: The Science Behind data-driven AI and ML Models

Quick Links