Density Estimation MLE and MAP

Density Estimation

Density estimation is a statistical technique used to estimate the probability density function of a random variable from a set of observed data points. It helps in understanding the underlying distribution of the data.

Maximum Likelihood Estimation (MLE) is a method used to find the parameters of a statistical model that maximize the likelihood function, which measures how well the model explains the observed data. In the context of density estimation, MLE aims to find the parameters that make the observed data most probable.

Maximum A Posteriori Estimation (MAP), on the other hand, incorporates prior knowledge about the parameters by using a prior distribution. It combines this prior information with the likelihood function to obtain a posterior distribution. In the context of density estimation, MAP seeks to find the parameters that maximize the posterior probability given both the observed data and prior information.

In summary, the key difference lies in the incorporation of prior information in MAP, making it a Bayesian approach, while MLE purely maximizes the likelihood of the observed data without considering prior knowledge.

Typically, estimating the entire distribution is intractable, and instead, we are happy to have the expected value of the distribution, such as the mean or mode. Maximum a Posteriori or MAP for short is a Bayesian-based approach to estimating a distribution and model parameters that best explain an observed dataset.

For example, given a sample of observation ($X$) from a domain ($x_1, x_2, x_3, \ldots,x_n)$, where each observation is drawn independently from the domain with the same probability distribution (so-called independent and identically distributed, i.i.d., or close to it).

Density estimation involves selecting a probability distribution function and the parameters of that distribution that best explains the joint probability distribution of the observed data ($X$).

Often estimating the density is too challenging; instead, we are happy with a point estimate from the target distribution, such as the mean.

There are many techniques for solving this problem, although two common approaches are:

Maximum A Posteriori (MAP), a Bayesian method.

Maximum Likelihood Estimation (MLE), a frequentist method.

MLE and MAP are method of estimating parameters of statistical models.

Both approaches frame the problem as optimization and involve searching for a distribution and set of parameters for the distribution that best describes the observed data

Both approaches frame the problem as optimization and involve searching for a distribution and set of parameters for the distribution that best describes the observed data.

In Maximum Likelihood Estimation, we wish to maximize the probability of observing the data from the joint probability distribution given a specific probability distribution and its parameters, stated formally as:

$P(X | \theta)$

$P(x_1, x_2, x_3,\ldots, x_n | \theta)$

This resulting conditional probability is referred to as the likelihood of observing the data given the model parameters.

The objective of Maximum Likelihood Estimation is to find the set of parameters ($\theta$) that maximize the likelihood function, e.g. result in the largest likelihood value.

maximize $P(X | \theta)$

An alternative and closely related approach is to consider the optimization problem from the perspective of Bayesian probability.

A popular replacement for maximizing the likelihood is maximizing the Bayesian posterior probability density of the parameters instead.

Recall that the Bayes theorem provides a principled way of calculating a conditional probability.

It involves calculating the conditional probability of one outcome given another outcome, using the inverse of this relationship, stated as follows:

$P(A | B) = (P(B | A) * P(A)) / P(B)$

The quantity that we are calculating is typically referred to as the posterior probability of $A$ given $B$ and $P(A)$ is referred to as the prior probability of $A$.

The normalizing constant of $P(B)$ can be removed, and the posterior can be shown to be proportional to the probability of $B$ given $A$ multiplied by the prior.
$P(A | B)$ is proportional to $P(B | A) * P(A)$

Or, simply:
$P(A | B) = P(B | A) * P(A)$

This is a helpful simplification as we are not interested in estimating a probability, but instead in optimizing a quantity. A proportional quantity is good enough for this purpose.

We can now relate this calculation to our desire to estimate a distribution and parameters ($\theta$) that best explains our dataset ($X$), as we described in the previous section. This can be stated as:

$P(\theta | X) = P(X | \theta) * P(\theta)$

Maximizing this quantity over a range of $\theta$ solves an optimization problem for estimating the central tendency of the posterior probability (e.g. the model of the distribution). As such, this technique is referred to as “maximum a posteriori estimation,” or MAP estimation for short, and sometimes simply “maximum posterior estimation.”

$maximize P(X | \theta) * P(\theta)$

We are typically not calculating the full posterior probability distribution, and in fact, this may not be tractable for many problems of interest.

Finding MAP hypotheses is often much easier than Bayesian learning, because it requires solving an optimization problem instead of a large summation (or integration) problem.

Instead, we are calculating a point estimation such as a moment of the distribution, like the mode, the most common value, which is the same as the mean for the normal distribution.

One common reason for desiring a point estimate is that most operations involving the Bayesian posterior for most interesting models are intractable, and a point estimate offers a tractable approximation.

Note: this is very similar to Maximum Likelihood Estimation, with the addition of the prior probability over the distribution and parameters.

In fact, if we assume that all values of $\theta$ are equally likely because we don’t have any prior information (e.g. a uniform prior), then both calculations are equivalent.

Because of this equivalence, both MLE and MAP often converge to the same optimization problem for many machine learning algorithms. This is not always the case; if the calculation of the MLE and MAP optimization problem differ, the MLE and MAP solution found for an algorithm may also differ.

The maximum likelihood hypothesis might not be the MAP hypothesis, but if one assumes uniform prior probabilities over the hypotheses then it is.

Maximum Likelihood Estimation (MLE)

Let us say we have an independent and identically distributed (iid) sample

$X = \{x^t\}_{t=1}^N$

We assume that $x^t$ are instances drawn from some known probability density family, $p(x|\theta)$, defined up to parameters, $\theta$:

$x^t \sim p(x|\theta)$

We want to find $\theta$ that makes sampling $x^t$ from $p(x|\theta)$ as likely as possible. Because $x^t$ are independent, the likelihood of parameter $\theta$ given sample $X$ is the product of the likelihoods of the individual points:

$l(\theta|X)=p(X|\theta)=\prod_{t=1}^N p(x^t| \theta)$

In maximum likelihood estimation, we are interested in finding $\theta$ that estimation makes $X$ the most likely to be drawn. We thus search for $\theta$ that maximizes the likelihood, which we denote by $l(\theta|X)$. We can maximize the log of the likelihood without changing the value where it takes its maximum.$log(·)$ converts the product into a sum and leads to further computational simplification when certain densities are assumed, for example, containing exponents. The log likelihood is defined as

$L(\theta|X)=log\quad l (\theta|X)=\sum_{t=1}^N log\quad p(x^t| \theta)$

Let us now see some distributions that arise in the applications we are interested in. If we have a two-class problem, the distribution we use is Bernoulli. When there are $K > 2$ classes, its generalization is the multinomial. Gaussian (normal) density is the one most frequently used for modeling class-conditional input densities with numeric input. For Bernoulli distribution, we discuss the maximum likelihood estimators (MLE) of their parameters.

Bernoulli Density

In a Bernoulli distribution, there are two outcomes: An event occurs or it does not; for example, an instance is a positive example of the class, or it is not. The event occurs and the Bernoulli random variable $X$ takes the value 1 with probability $p$, and the non occurrence of the event has probability $1 − p$ and this is denoted by $X$ taking the value 0. This is written as

$P(x) = p^x(1 − p)^{1−x}, x \in \{0, 1\}$

The expected value and variance can be calculated as

$E[X]=\sum_x xp(x)=1.p+0.(1-p)=p$

$Var[X]=\sum_x ( x- E(x))^2 p(x)=p(1-p)$

$p$ is the only parameter and given an iid sample $X = \{x^t\}_{t=1}^N$, where $x^t \in \{0, 1\}$, we want to calculate its estimator, $\hat{p}$. The log likelihood is

$L(p|X)=log \prod_{t=1}^N p^{(x^t)}(1-p)^{(1-x^t)}$

$=\sum_t x^t log(p) + \left( N- \sum_t x^t\right) log(1-p)$

$\hat{p}$ that maximizes the log likelihood can be found by solving for $dL/dp =0$.The hat (circumflex) denotes that it is an estimate.

$\hat{p}=\frac{\sum_t x^t}{N}$

The estimate for $p$ is the ratio of the number of occurrences of the event to the number of experiments. Remembering that if $X$ is Bernoulli with $p, E[X] = p$, and, as expected, the maximum likelihood estimator of the mean is the sample average.

Note that the estimate is a function of the sample and is another random variable; we can talk about the distribution of $\hat{p_i}$ given different $X_i$ sampled from the same $p(x)$. For example, the variance of the distribution of $\hat{p_i}$ is expected to decrease as $N$ increases; as the samples get bigger, they (and hence their averages) get more similar.

Example

Suppose we have the following set of observations from a Bernoulli process:

x = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]

p = \frac{k}{n}

5.Calculate the Estimate :

Number of observations ( $n$ ) = 10
Number of successes ( $k$ ) = 6 (since $x$ contains six 1's)

Therefore, the MLE for $p$ is:

\hat{p} = \frac{k}{n} = \frac{6}{10} = 0.6

So, the MLE estimate for the probability of success $p$ in the Bernoulli distribution, given our data, is $\hat{p} = 0.6.$

Example: Probability that Liverpool FC wins a match in the next season

In 2018-19 season, Liverpool FC won 30 matches out of 38 matches in Premier league. Having this data, we’d like to make a guess at the probability that Liverpool FC wins a match in the next season.

The simplest guess here would be 30/38 = 79%, which is the best possible guess based on the data. This actually is an estimation with MLE method.

Then, assume we know that Liverpool’s winning percentages for the past few seasons were around 50%. Do you think our best guess is still 79%? I think some value between 50% and 79% would be more realistic, considering the prior knowledge as well as the data from this season. This is an estimation with MAP method.

In this example, we’re simplifying that Liverpool has a single winning probability (let’s call this as $\theta$) throughout all matches across seasons, regardless of uniqueness of each match and any complex factors of real football matches. On the other words, we’re assuming each of Liverpool’s match as a Bernoulli trial with the winning probability $\theta$.

With this assumption, we can describe probability that Liverpool wins $k$ times out of $n$ matches for any given number $k$ and $n$ ($k≤n$). More precisely, we assume that the number of wins of Liverpool follows binomial distribution with parameter $\theta$. The formula of the probability that Liverpool wins $k$ times out of $n$ matches, given the winning probability $\theta$, is below.

This simplification (describing the probability using just a single parameter $theta$ regardless of real world complexity) is the statistical modelling of this example, and $\theta$ is the parameter to be estimated.

Maximum Likelihood Estimation

In the previous section, we got the formula of probability that Liverpool wins $k$ times out of $n$ matches for given $\theta$.

Since we have the observed data from this season, which is 30 wins out of 38 matches (let’s call this data as $D$), we can calculate $P(D|θ)$ — the probability that this data $D$ is observed for given $\theta$. Let’s calculate $P(D|\theta)$ for $\theta=0.1$ and $\theta=0.7$ as examples.

When Liverpool’s winning probability $\theta = 0.1$, the probability that this data $D$ (30 wins in 38 matches) is observed is following.

$P(D|θ) = 0.00000000000000000000211$. So, if Liverpool’s winning probability $\theta$ is actually 0.1, this data $D$(30 wins in 38 matches) is extremely unlikely to be observed. Then what if $\theta = 0.7$?

Much higher than previous one. So if Liverpool’s winning probability $\theta$ is 0.7, this data $D$ is much more likely to be observed than when $θ = 0.1$.

Based on this comparison, we would be able to say that $\theta$ is more likely to be 0.7 than 0.1 considering the actual observed data $D$.

Here, we’ve been calculating the probability that $D$ is observed for each $\theta$, but at the same time, we can also say that we’ve been checking likelihood of each value of $\theta$ based on the observed data. Because of this, $P(D|\theta)$ is also considered as Likelihood of $\theta$. The next question here is, what is the exact value of $\theta$ which maximise the likelihood $P(D|\theta)$? Yes, this is the Maximum Likelihood Estimation!

The value of $\theta$ maximising the likelihood can be obtained by having derivative of the likelihood function with respect to θ, and setting it to zero.

By solving this, $\theta = 0,1$ or $k/n$. Since likelihood goes to zero when $θ= 0$ or $1$, the value of $\theta$ maximise the likelihood is $k/n$.

In this example, the estimated value of $\theta$ is 30/38 = 78.9% when estimated with MLE.

Maximum Likelihood Estimation of Normal Distribution

To derive the Maximum Likelihood Estimators (MLEs) for the mean ( $\mu$ ) and variance ( $\sigma^2$ ) of a normal distribution, we need to maximize the log-likelihood function. Here's a step-by-step process:

Given Data and Model

Let's assume we have a set of $n$ independent and identically distributed observations $x_1, x_2, \ldots, x_n$ from a normal distribution $\mathcal{N}(\mu, \sigma^2)$

The probability density function (pdf) of a normal distribution is:

f(x; \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right)

Likelihood Function

The likelihood function for the given data is:

L(\mu, \sigma^2) = \prod_{i=1}^{n} f(x_i; \mu, \sigma^2) = \prod_{i=1}^{n} \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left( -\frac{(x_i - \mu)^2}{2\sigma^2} \right)

Log-Likelihood Function

Taking the logarithm of the likelihood function, we get the log-likelihood function:

\ell(\mu, \sigma^2) = \log L(\mu, \sigma^2) = \sum_{i=1}^{n} \log \left( \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left( -\frac{(x_i - \mu)^2}{2\sigma^2} \right) \right)

\ell(\mu, \sigma^2) = \sum_{i=1}^{n} \left( -\frac{1}{2} \log(2 \pi \sigma^2) - \frac{(x_i - \mu)^2}{2\sigma^2} \right)

\ell(\mu, \sigma^2) = -\frac{n}{2} \log(2 \pi \sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (x_i - \mu)^2

Derivative with respect to $\mu$

To find the MLE for $\mu$ , we take the derivative of the log-likelihood function with respect to $\mu$ and set it to zero:

\frac{\partial \ell(\mu, \sigma^2)}{\partial \mu} = -\frac{1}{2\sigma^2} \sum_{i=1}^{n} \frac{\partial}{\partial \mu} (x_i - \mu)^2

\frac{\partial \ell(\mu, \sigma^2)}{\partial \mu} = -\frac{1}{2\sigma^2} \sum_{i=1}^{n} 2(x_i - \mu)(-1)

\frac{\partial \ell(\mu, \sigma^2)}{\partial \mu} = \frac{1}{\sigma^2} \sum_{i=1}^{n} (x_i - \mu)

Setting the derivative to zero, we get:

\frac{1}{\sigma^2} \sum_{i=1}^{n} (x_i - \mu) = 0

\sum_{i=1}^{n} (x_i - \mu) = 0

\sum_{i=1}^{n} x_i - n\mu = 0

n\mu = \sum_{i=1}^{n} x_i

\hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} x_i

So, the MLE for $\mu$ is:

\hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} x_i

Derivative with respect to $\sigma^2$

To find the MLE for $\sigma^2$ , we take the derivative of the log-likelihood function with respect to $\sigma^2$ and set it to zero:

\frac{\partial \ell(\mu, \sigma^2)}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_{i=1}^{n} (x_i - \mu)^2

Setting the derivative to zero, we get:

-\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_{i=1}^{n} (x_i - \mu)^2 = 0

-\frac{n}{\sigma^2} + \frac{1}{\sigma^4} \sum_{i=1}^{n} (x_i - \mu)^2 = 0

-\frac{n}{\sigma^2} + \frac{S}{\sigma^4} = 0

where $S = \sum_{i=1}^{n} (x_i - \mu)^2$

Multiplying through by $\sigma^4$ , we get:

-n\sigma^2 + S = 0

S = n\sigma^2

\sigma^2 = \frac{S}{n} = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2

So, the MLE for $\sigma^2$ is:

\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \hat{\mu})^2

The MLE for the mean $\mu$ is:
$\hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} x_i$
The MLE for the variance $\sigma^2$ is:
$\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \hat{\mu})^2$

Example

Let's consider a simple example where we assume our data follows a normal (Gaussian) distribution. We want to estimate the mean μ and variance $σ ^2$ of this distribution.

Step-by-Step Example

Assume the Data: Suppose we have the following set of observations:
$x=[2.3,1.9,3.1,2.8,3.0]$
Model Assumption: We assume the data follows a normal distribution with unknown mean $\mu$ and variance $\sigma^2$
Likelihood Function: The probability density function of the normal distribution is:
$f(x; \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right)$
Therefore, the likelihood function for our data is:
$L(\mu, \sigma^2) = \prod_{i=1}^5 \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left( -\frac{(x_i - \mu)^2}{2\sigma^2} \right)$
Log-Likelihood Function: Taking the logarithm of the likelihood function, we get:
$\ell(\mu, \sigma^2) = \sum_{i=1}^5 \log \left( \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left( -\frac{(x_i - \mu)^2}{2\sigma^2} \right) \right)$
Simplifying, this becomes:
$\ell(\mu, \sigma^2) = -\frac{5}{2} \log(2 \pi \sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^5 (x_i - \mu)^2$
Maximize the Log-Likelihood:
- To find the estimates for $\mu$ and $\sigma^2$ , we take partial derivatives of $\ell(\mu, \sigma^2)$ with respect to $\mu$ and $\sigma^2$ , and set them to zero.
- Solving these equations gives: $\hat{\mu} = \frac{1}{n} \sum_{i=1}^5 x_i$ $\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^5 (x_i - \hat{\mu})^2$
Calculate Estimates:
- Mean estimate: $\hat{\mu} = \frac{2.3 + 1.9 + 3.1 + 2.8 + 3.0}{5} = 2.62$
- Variance estimate: $\hat{\sigma}^2 = \frac{(2.3 - 2.62)^2 + (1.9 - 2.62)^2 + (3.1 - 2.62)^2 + (2.8 - 2.62)^2 + (3.0 - 2.62)^2}{5} = 0.2156$

So, the MLE estimates for the mean and variance of the normal distribution given our data are

\hat{\mu} = 2.62

and

\hat{\sigma}^2 = 0.2156

Example(university question)

Suppose the weights of randomly selected female students at school are normally distributed with unknown mean and standard deviation.A random sample of 10 female students yielded the following weights(in pounds) 115,122,130,127,149,160,152,138,149,180.Find the maximum likely hood estimate of the mean weight and variance.

Calculate the Mean ( $\mu$ ):

\hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} x_i

Here, $n = 10 (the number of data points), and the data points$ $x_i$ are given.

\hat{\mu} = \frac{1}{10} (115 + 122 + 130 + 127 + 149 + 160 + 152 + 138 + 149 + 180)

First, sum the data points:

115 + 122 + 130 + 127 + 149 + 160 + 152 + 138 + 149 + 180 = 1422

Now, calculate the mean:

$\hat{\mu}=\frac{1422}{10}=142.2$

Calculate the Variance ( $\sigma^2$ ):

\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \hat{\mu})^2

Maximum A Posteriori Estimation (MAP)

Maximum A Posteriori (MAP) estimation is a method in Bayesian statistics used to estimate the parameters of a statistical model. It combines the likelihood of the observed data with prior information about the parameters to find the most probable parameter value.

MLE is powerful when you have enough data. However, it doesn’t work well when observed data size is small. For example, if Liverpool only had 2 matches and they won the 2 matches, then the estimated value of $\theta$ by MLE is $2/2 = 1$. It means that the estimation says Liverpool wins 100%, which is unrealistic estimation. MAP can help dealing with this issue.

Assume that we have a prior knowledge that Liverpool’s winning percentage for the past few seasons were around 50%.Then, without the data from this season, we already have somewhat idea of potential value of $\theta$. Based (only) on the prior knowledge, the value of $\theta$ is most likely to be 0.5, and less likely to be 0 or 1. On the other words, the probability of $\theta=0.5$ is higher than $\theta=0$ or $1$. Calling this as the prior probability $P(θ)$.

Then, having the observed data $D$ (30 win out of 38 matches) from this season, we can update this $P(\theta)$ which is based only on the prior knowledge. The updated probability of $\theta$ given $D$ is expressed as $P(\theta|D)$ and called the posterior probability.

Now, we want to know the best guess of $\theta$ considering both our prior knowledge and the observed data. It means maximising $P(\theta|D)$ and it’s the MAP estimation.