Density Estimation MLE and MAP

Density Estimation

Density estimation is a statistical technique used to estimate the probability density function of a random variable from a set of observed data points. It helps in understanding the underlying distribution of the data.

Maximum Likelihood Estimation (MLE) is a method used to find the parameters of a statistical model that maximize the likelihood function, which measures how well the model explains the observed data. In the context of density estimation, MLE aims to find the parameters that make the observed data most probable.

Maximum A Posteriori Estimation (MAP), on the other hand, incorporates prior knowledge about the parameters by using a prior distribution. It combines this prior information with the likelihood function to obtain a posterior distribution. In the context of density estimation, MAP seeks to find the parameters that maximize the posterior probability given both the observed data and prior information.

In summary, the key difference lies in the incorporation of prior information in MAP, making it a Bayesian approach, while MLE purely maximizes the likelihood of the observed data without considering prior knowledge.

Typically, estimating the entire distribution is intractable, and instead, we are happy to have the expected value of the distribution, such as the mean or mode. Maximum a Posteriori or MAP for short is a Bayesian-based approach to estimating a distribution and model parameters that best explain an observed dataset.

For example, given a sample of observation ($X$) from a domain ($x_1, x_2, x_3, \ldots,x_n)$, where each observation is drawn independently from the domain with the same probability distribution (so-called independent and identically distributed, i.i.d., or close to it).

Density estimation involves selecting a probability distribution function and the parameters of that distribution that best explains the joint probability distribution of the observed data ($X$).

Often estimating the density is too challenging; instead, we are happy with a point estimate from the target distribution, such as the mean.

There are many techniques for solving this problem, although two common approaches are:
Maximum A Posteriori (MAP), a Bayesian method.
Maximum Likelihood Estimation (MLE), a frequentist method.

MLE and MAP are method of estimating parameters of statistical models.

Both approaches frame the problem as optimization and involve searching for a distribution and set of parameters for the distribution that best describes the observed data

Both approaches frame the problem as optimization and involve searching for a distribution and set of parameters for the distribution that best describes the observed data.

In Maximum Likelihood Estimation, we wish to maximize the probability of observing the data from the joint probability distribution given a specific probability distribution and its parameters, stated formally as:

$P(X | \theta)$
or
$P(x_1, x_2, x_3,\ldots, x_n | \theta)$

This resulting conditional probability is referred to as the likelihood of observing the data given the model parameters.

The objective of Maximum Likelihood Estimation is to find the set of parameters ($\theta$) that maximize the likelihood function, e.g. result in the largest likelihood value.

maximize $P(X | \theta)$

An alternative and closely related approach is to consider the optimization problem from the perspective of Bayesian probability.

A popular replacement for maximizing the likelihood is maximizing the Bayesian posterior probability density of the parameters instead.

Recall that the Bayes theorem provides a principled way of calculating a conditional probability.

It involves calculating the conditional probability of one outcome given another outcome, using the inverse of this relationship, stated as follows:

$P(A | B) = (P(B | A) * P(A)) / P(B)$

The quantity that we are calculating is typically referred to as the posterior probability of $A$ given $B$ and $P(A)$ is referred to as the prior probability of $A$.

The normalizing constant of $P(B)$ can be removed, and the posterior can be shown to be proportional to the probability of $B$ given $A$ multiplied by the prior.
$P(A | B)$ is proportional to $P(B | A) * P(A)$

Or, simply:
$P(A | B) = P(B | A) * P(A)$
This is a helpful simplification as we are not interested in estimating a probability, but instead in optimizing a quantity. A proportional quantity is good enough for this purpose.
We can now relate this calculation to our desire to estimate a distribution and parameters ($\theta$) that best explains our dataset ($X$), as we described in the previous section. This can be stated as:

$P(\theta | X) = P(X | \theta) * P(\theta)$

Maximizing this quantity over a range of $\theta$ solves an optimization problem for estimating the central tendency of the posterior probability (e.g. the model of the distribution). As such, this technique is referred to as “maximum a posteriori estimation,” or MAP estimation for short, and sometimes simply “maximum posterior estimation.”

$maximize P(X | \theta) * P(\theta)$

We are typically not calculating the full posterior probability distribution, and in fact, this may not be tractable for many problems of interest.
Finding MAP hypotheses is often much easier than Bayesian learning, because it requires solving an optimization problem instead of a large summation (or integration) problem.

Instead, we are calculating a point estimation such as a moment of the distribution, like the mode, the most common value, which is the same as the mean for the normal distribution.
One common reason for desiring a point estimate is that most operations involving the Bayesian posterior for most interesting models are intractable, and a point estimate offers a tractable approximation.

Note: this is very similar to Maximum Likelihood Estimation, with the addition of the prior probability over the distribution and parameters.

In fact, if we assume that all values of $\theta$ are equally likely because we don’t have any prior information (e.g. a uniform prior), then both calculations are equivalent.
Because of this equivalence, both MLE and MAP often converge to the same optimization problem for many machine learning algorithms. This is not always the case; if the calculation of the MLE and MAP optimization problem differ, the MLE and MAP solution found for an algorithm may also differ.
The maximum likelihood hypothesis might not be the MAP hypothesis, but if one assumes uniform prior probabilities over the hypotheses then it is.

Maximum Likelihood Estimation (MLE)
Let us say we have an independent and identically distributed (iid) sample
$X = \{x^t\}_{t=1}^N$

We assume that $x^t$ are instances drawn from some known probability density family, $p(x|\theta)$, defined up to parameters, $\theta$:

$x^t \sim p(x|\theta)$

We want to find $\theta$ that makes sampling $x^t$ from $p(x|\theta)$ as likely as possible. Because $x^t$ are independent, the likelihood of parameter $\theta$ given sample $X$ is the product of the likelihoods of the individual points:

$l(\theta|X)=p(X|\theta)=\prod_{t=1}^N p(x^t| \theta)$

In maximum likelihood estimation, we are interested in finding $\theta$ that estimation makes $X$ the most likely to be drawn. We thus search for $\theta$ that maximizes the likelihood, which we denote by $l(\theta|X)$. We can maximize the log of the likelihood without changing the value where it takes its maximum.$log(·)$ converts the product into a sum and leads to further computational simplification when certain densities are assumed, for example,  containing exponents. The log likelihood is defined as

$L(\theta|X)=log\quad l (\theta|X)=\sum_{t=1}^N log\quad p(x^t| \theta)$

Let us now see some distributions that arise in the applications we are interested in. If we have a two-class problem, the distribution we use is Bernoulli. When there are $K > 2$ classes, its generalization is the multinomial. Gaussian (normal) density is the one most frequently used for modeling class-conditional input densities with numeric input. For Bernoulli distribution, we discuss the maximum likelihood estimators (MLE) of their parameters.

Bernoulli Density
In a Bernoulli distribution, there are two outcomes: An event occurs or it does not; for example, an instance is a positive example of the class, or it is not. The event occurs and the Bernoulli random variable $X$ takes the value 1 with probability $p$, and the non occurrence of the event has probability $1 − p$ and this is denoted by $X$ taking the value 0. This is written as

$P(x) = p^x(1 − p)^{1−x}, x \in \{0, 1\}$

The expected value and variance can be calculated as

$E[X]=\sum_x xp(x)=1.p+0.(1-p)=p$
$Var[X]=\sum_x ( x- E(x))^2 p(x)=p(1-p)$
$p$ is the only parameter and given an iid sample $X = \{x^t\}_{t=1}^N$, where $x^t \in \{0, 1\}$, we want to calculate its estimator, $\hat{p}$. The log likelihood is

$L(p|X)=log \prod_{t=1}^N p^{(x^t)}(1-p)^{(1-x^t)}$
$=\sum_t x^t log(p) + \left( N- \sum_t x^t\right) log(1-p)$
$\hat{p}$ that maximizes the log likelihood can be found by solving for $dL/dp =0$.The hat (circumflex) denotes that it is an estimate.

$\hat{p}=\frac{\sum_t x^t}{N}$
The estimate for $p$ is the ratio of the number of occurrences of the event to the number of experiments. Remembering that if $X$ is Bernoulli with $p, E[X] = p$, and, as expected, the maximum likelihood estimator of the mean is the sample average.

Note that the estimate is a function of the sample and is another random variable; we can talk about the distribution of $\hat{p_i}$ given different $X_i$ sampled from the same $p(x)$. For example, the variance of the distribution of $\hat{p_i}$ is expected to decrease as $N$ increases; as the samples get bigger, they (and hence their averages) get more similar.

Example

Suppose we have the following set of observations from a Bernoulli process:

x=[1,0,1,1,0,1,0,0,1,1]x = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]

We want to estimate the parameter pp, the probability of success.

Steps

  1. Likelihood Function: For a Bernoulli distribution, the likelihood function given our data is:

    L(p)=i=1npxi(1p)1xiL(p) = \prod_{i=1}^{n} p^{x_i} (1 - p)^{1 - x_i}

    where nn is the number of observations.

  2. Log-Likelihood Function: Taking the logarithm of the likelihood function, we get:

    (p)=logL(p)=i=1n(xilogp+(1xi)log(1p))\ell(p) = \log L(p) = \sum_{i=1}^{n} \left( x_i \log p + (1 - x_i) \log (1 - p) \right)
  3. Maximize the Log-Likelihood: To find the MLE for pp, we take the derivative of (p) with respect to pp and set it to zero:

    (p)p=i=1n(xip1xi1p)=0\frac{\partial \ell(p)}{\partial p} = \sum_{i=1}^{n} \left( \frac{x_i}{p} - \frac{1 - x_i}{1 - p} \right) = 0
  4. Solve for pp:

    i=1nxip=i=1n1xi1p\sum_{i=1}^{n} \frac{x_i}{p} = \sum_{i=1}^{n} \frac{1 - x_i}{1 - p}
    i=1nxi(1p)=i=1n(1xi)p\sum_{i=1}^{n} x_i (1 - p) = \sum_{i=1}^{n} (1 - x_i) p
    (1p)i=1nxi=pi=1n(1xi)(1 - p) \sum_{i=1}^{n} x_i = p \sum_{i=1}^{n} (1 - x_i)

    Let k=i=1nxik = \sum_{i=1}^{n} x_i the total number of successes:

$(1−p)k=p(n−k)$
$k−kp=np−kp$
$k=np$
$p=\frac{k}{n}$
p = \frac{k}{n}
5.Calculate the Estimate:
  • Number of observations (n) = 10
  • Number of successes (k) = 6 (since x contains six 1's)

Therefore, the MLE for p is:

p^=kn=610=0.6

So, the MLE estimate for the probability of success pp in the Bernoulli distribution, given our data, is p^=0.6.

Example: Probability that Liverpool FC wins a match in the next season

In 2018-19 season, Liverpool FC won 30 matches out of 38 matches in Premier league. Having this data, we’d like to make a guess at the probability that Liverpool FC wins a match in the next season.

The simplest guess here would be 30/38 = 79%, which is the best possible guess based on the data. This actually is an estimation with MLE method.

Then, assume we know that Liverpool’s winning percentages for the past few seasons were around 50%. Do you think our best guess is still 79%? I think some value between 50% and 79% would be more realistic, considering the prior knowledge as well as the data from this season. This is an estimation with MAP method.

In this example, we’re simplifying that Liverpool has a single winning probability (let’s call this as $\theta$) throughout all matches across seasons, regardless of uniqueness of each match and any complex factors of real football matches. On the other words, we’re assuming each of Liverpool’s match as a Bernoulli trial with the winning probability $\theta$.

With this assumption, we can describe probability that Liverpool wins $k$ times out of $n$ matches for any given number $k$ and $n$ ($k≤n$). More precisely, we assume that the number of wins of Liverpool follows binomial distribution with parameter $\theta$. The formula of the probability that Liverpool wins $k$ times out of $n$ matches, given the winning probability $\theta$, is below.
This simplification (describing the probability using just a single parameter $theta$ regardless of real world complexity) is the statistical modelling of this example, and $\theta$ is the parameter to be estimated.

Maximum Likelihood Estimation

In the previous section, we got the formula of probability that Liverpool wins $k$ times out of $n$ matches for given $\theta$.
Since we have the observed data from this season, which is 30 wins out of 38 matches (let’s call this data as $D$), we can calculate $P(D|θ)$ — the probability that this data $D$ is observed for given $\theta$. Let’s calculate $P(D|\theta)$ for $\theta=0.1$ and $\theta=0.7$ as examples.

When Liverpool’s winning probability $\theta = 0.1$, the probability that this data $D$ (30 wins in 38 matches) is observed is following.


$P(D|θ) = 0.00000000000000000000211$. So, if Liverpool’s winning probability $\theta$ is actually 0.1, this data $D$(30 wins in 38 matches) is extremely unlikely to be observed. Then what if $\theta = 0.7$?


Much higher than previous one. So if Liverpool’s winning probability $\theta$ is 0.7, this data $D$ is much more likely to be observed than when $θ = 0.1$.

Based on this comparison, we would be able to say that $\theta$ is more likely to be 0.7 than 0.1 considering the actual observed data $D$.
Here, we’ve been calculating the probability that $D$ is observed for each $\theta$, but at the same time, we can also say that we’ve been checking likelihood of each value of $\theta$ based on the observed data. Because of this, $P(D|\theta)$ is also considered as Likelihood of $\theta$. The next question here is, what is the exact value of $\theta$ which maximise the likelihood $P(D|\theta)$? Yes, this is the Maximum Likelihood Estimation!

The value of $\theta$ maximising the likelihood can be obtained by having derivative of the likelihood function with respect to θ, and setting it to zero.


By solving this, $\theta = 0,1$ or $k/n$. Since likelihood goes to zero when $θ= 0$ or $1$, the value of $\theta$ maximise the likelihood is $k/n$.


In this example, the estimated value of $\theta$ is 30/38 = 78.9% when estimated with MLE.

Maximum Likelihood Estimation of Normal Distribution

To derive the Maximum Likelihood Estimators (MLEs) for the mean (μ\mu) and variance (σ2\sigma^2) of a normal distribution, we need to maximize the log-likelihood function. Here's a step-by-step process:

Given Data and Model

Let's assume we have a set of nn independent and identically distributed observations x1,x2,,xnx_1, x_2, \ldots, x_nfrom a normal distribution N(μ,σ2)\mathcal{N}(\mu, \sigma^2).

The probability density function (pdf) of a normal distribution is:

f(x;μ,σ2)=12πσ2exp((xμ)22σ2)f(x; \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right)

Likelihood Function

The likelihood function for the given data is:

L(μ,σ2)=i=1nf(xi;μ,σ2)=i=1n12πσ2exp((xiμ)22σ2)L(\mu, \sigma^2) = \prod_{i=1}^{n} f(x_i; \mu, \sigma^2) = \prod_{i=1}^{n} \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left( -\frac{(x_i - \mu)^2}{2\sigma^2} \right)

Log-Likelihood Function

Taking the logarithm of the likelihood function, we get the log-likelihood function:

(μ,σ2)=logL(μ,σ2)=i=1nlog(12πσ2exp((xiμ)22σ2))\ell(\mu, \sigma^2) = \log L(\mu, \sigma^2) = \sum_{i=1}^{n} \log \left( \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left( -\frac{(x_i - \mu)^2}{2\sigma^2} \right) \right)
(μ,σ2)=i=1n(12log(2πσ2)(xiμ)22σ2)\ell(\mu, \sigma^2) = \sum_{i=1}^{n} \left( -\frac{1}{2} \log(2 \pi \sigma^2) - \frac{(x_i - \mu)^2}{2\sigma^2} \right)
(μ,σ2)=n2log(2πσ2)12σ2i=1n(xiμ)2\ell(\mu, \sigma^2) = -\frac{n}{2} \log(2 \pi \sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (x_i - \mu)^2

Derivative with respect to $\mu$

To find the MLE for μ\mu, we take the derivative of the log-likelihood function with respect to μ\mu and set it to zero:

(μ,σ2)μ=12σ2i=1nμ(xiμ)2\frac{\partial \ell(\mu, \sigma^2)}{\partial \mu} = -\frac{1}{2\sigma^2} \sum_{i=1}^{n} \frac{\partial}{\partial \mu} (x_i - \mu)^2
(μ,σ2)μ=12σ2i=1n2(xiμ)(1)\frac{\partial \ell(\mu, \sigma^2)}{\partial \mu} = -\frac{1}{2\sigma^2} \sum_{i=1}^{n} 2(x_i - \mu)(-1)
(μ,σ2)μ=1σ2i=1n(xiμ)\frac{\partial \ell(\mu, \sigma^2)}{\partial \mu} = \frac{1}{\sigma^2} \sum_{i=1}^{n} (x_i - \mu)

Setting the derivative to zero, we get:

1σ2i=1n(xiμ)=0\frac{1}{\sigma^2} \sum_{i=1}^{n} (x_i - \mu) = 0
i=1n(xiμ)=0\sum_{i=1}^{n} (x_i - \mu) = 0
i=1nxinμ=0\sum_{i=1}^{n} x_i - n\mu = 0
nμ=i=1nxin\mu = \sum_{i=1}^{n} x_i
μ^=1ni=1nxi\hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} x_i

So, the MLE for μ\mu is:

μ^=1ni=1nxi\hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} x_i

Derivative with respect to σ2\sigma^2

To find the MLE for σ2\sigma^2, we take the derivative of the log-likelihood function with respect to σ2\sigma^2 and set it to zero:

(μ,σ2)σ2=n2σ2+12σ4i=1n(xiμ)2\frac{\partial \ell(\mu, \sigma^2)}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_{i=1}^{n} (x_i - \mu)^2

Setting the derivative to zero, we get:

n2σ2+12σ4i=1n(xiμ)2=0-\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_{i=1}^{n} (x_i - \mu)^2 = 0
nσ2+1σ4i=1n(xiμ)2=0-\frac{n}{\sigma^2} + \frac{1}{\sigma^4} \sum_{i=1}^{n} (x_i - \mu)^2 = 0
nσ2+Sσ4=0-\frac{n}{\sigma^2} + \frac{S}{\sigma^4} = 0

where S=i=1n(xiμ)2S = \sum_{i=1}^{n} (x_i - \mu)^2

Multiplying through by σ4\sigma^4, we get:

nσ2+S=0-n\sigma^2 + S = 0
S=nσ2S = n\sigma^2
σ2=Sn=1ni=1n(xiμ)2\sigma^2 = \frac{S}{n} = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2

So, the MLE for σ2\sigma^2 is:

σ^2=1ni=1n(xiμ^)2\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \hat{\mu})^2
  • The MLE for the mean μ\mu is:

    μ^=1ni=1nxi\hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} x_i
  • The MLE for the variance σ2\sigma^2 is:

    σ^2=1ni=1n(xiμ^)2\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \hat{\mu})^2
Example

Let's consider a simple example where we assume our data follows a normal (Gaussian) distribution. We want to estimate the mean μ  and variance $σ ^2$ of this distribution.

Step-by-Step Example

  1. Assume the Data: Suppose we have the following set of observations:

    $x=[2.3,1.9,3.1,2.8,3.0]$
  2. Model Assumption: We assume the data follows a normal distribution with unknown mean μ\mu and variance σ2\sigma^2.

  3. Likelihood Function: The probability density function of the normal distribution is:

    f(x;μ,σ2)=12πσ2exp((xμ)22σ2)f(x; \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right)

    Therefore, the likelihood function for our data is:

    L(μ,σ2)=i=1512πσ2exp((xiμ)22σ2)L(\mu, \sigma^2) = \prod_{i=1}^5 \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left( -\frac{(x_i - \mu)^2}{2\sigma^2} \right)
  4. Log-Likelihood Function: Taking the logarithm of the likelihood function, we get:

    (μ,σ2)=i=15log(12πσ2exp((xiμ)22σ2))\ell(\mu, \sigma^2) = \sum_{i=1}^5 \log \left( \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left( -\frac{(x_i - \mu)^2}{2\sigma^2} \right) \right)

    Simplifying, this becomes:

    (μ,σ2)=52log(2πσ2)12σ2i=15(xiμ)2\ell(\mu, \sigma^2) = -\frac{5}{2} \log(2 \pi \sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^5 (x_i - \mu)^2
  5. Maximize the Log-Likelihood:

    • To find the estimates for μ\mu and σ2\sigma^2, we take partial derivatives of (μ,σ2)\ell(\mu, \sigma^2) with respect to μ\mu and σ2\sigma^2, and set them to zero.
    • Solving these equations gives:
      μ^=1ni=15xi\hat{\mu} = \frac{1}{n} \sum_{i=1}^5 x_i
      σ^2=1ni=15(xiμ^)2\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^5 (x_i - \hat{\mu})^2
  6. Calculate Estimates:

    • Mean estimate:
      μ^=2.3+1.9+3.1+2.8+3.05=2.62\hat{\mu} = \frac{2.3 + 1.9 + 3.1 + 2.8 + 3.0}{5} = 2.62
    • Variance estimate:
      σ^2=(2.32.62)2+(1.92.62)2+(3.12.62)2+(2.82.62)2+(3.02.62)25=0.2156\hat{\sigma}^2 = \frac{(2.3 - 2.62)^2 + (1.9 - 2.62)^2 + (3.1 - 2.62)^2 + (2.8 - 2.62)^2 + (3.0 - 2.62)^2}{5} = 0.2156

So, the MLE estimates for the mean and variance of the normal distribution given our data are 
μ^=2.62\hat{\mu} = 2.62 and σ^2=0.2156\hat{\sigma}^2 = 0.2156.

Example(university question)
Suppose the weights of randomly selected female students at school are normally distributed with unknown mean and standard deviation.A random sample of 10 female students yielded the following weights(in pounds) 115,122,130,127,149,160,152,138,149,180.Find the maximum likely hood estimate of the mean weight and variance.

Calculate the Mean (μ\mu):

μ^=1ni=1nxi\hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} x_i

Here, n=10 (the number of data points), and the data points xix_i are given.

μ^=110(115+122+130+127+149+160+152+138+149+180)\hat{\mu} = \frac{1}{10} (115 + 122 + 130 + 127 + 149 + 160 + 152 + 138 + 149 + 180)

First, sum the data points:

115+122+130+127+149+160+152+138+149+180=1422115 + 122 + 130 + 127 + 149 + 160 + 152 + 138 + 149 + 180 = 1422


Now, calculate the mean:

$\hat{\mu}=\frac{1422}{10}=142.2$

Calculate the Variance (σ2\sigma^2):

σ^2=1ni=1n(xiμ^)2\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \hat{\mu})^2

Now, calculate the variance:

σ^2=3479.610=347.96\hat{\sigma}^2 = \frac{3479.6}{10} = 347.96

Maximum A Posteriori Estimation (MAP)
Maximum A Posteriori (MAP) estimation is a method in Bayesian statistics used to estimate the parameters of a statistical model. It combines the likelihood of the observed data with prior information about the parameters to find the most probable parameter value.

MLE is powerful when you have enough data. However, it doesn’t work well when observed data size is small. For example, if Liverpool only had 2 matches and they won the 2 matches, then the estimated value of $\theta$ by MLE is $2/2 = 1$. It means that the estimation says Liverpool wins 100%, which is unrealistic estimation. MAP can help dealing with this issue.

Assume that we have a prior knowledge that Liverpool’s winning percentage for the past few seasons were around 50%.Then, without the data from this season, we already have somewhat idea of potential value of $\theta$. Based (only) on the prior knowledge, the value of $\theta$ is most likely to be 0.5, and less likely to be 0 or 1. On the other words, the probability of $\theta=0.5$ is higher than $\theta=0$ or $1$. Calling this as the prior probability $P(θ)$.

Then, having the observed data $D$ (30 win out of 38 matches) from this season, we can update this $P(\theta)$ which is based only on the prior knowledge. The updated probability of $\theta$ given $D$ is expressed as $P(\theta|D)$ and called the posterior probability.

Now, we want to know the best guess of $\theta$ considering both our prior knowledge and the observed data. It means maximising $P(\theta|D)$ and it’s the MAP estimation.
The question here is, how to calculate $P(\theta|D)$? we checked the way to calculate $P(D|\theta)$but haven’t seen the way to calculate $P(\theta|D)$. To do so, we need to use Bayes’ theorem below.


with this theorem, we can calculate the posterior probability $P(\theta|D)$ using the likelihood $P(D|\theta)$ and the prior probability $P(\theta)$.

There’s $P(D)$ in the equation, but $P(D)$ is independent to the value of $\theta$. Since we’re only interested in finding $\theta$ maximising $P(\theta|D)$, we can ignore $P(D)$ in our maximisation.



The equation above means that the maximization of the posterior probability $P(\theta|D)$ with respect to $\theta$ is equal to the maximization of the product of Likelihood $P(D|\theta)$ and Prior probability $P(\theta)$ with respect to $\theta$.


Intrinsically, we can use any formulas describing probability distribution as $P(\theta)$ to express our prior knowledge well. However, for the computational simplicity, specific probability distributions are used corresponding to the probability distribution of likelihood. It’s called conjugate prior distribution.

In this example, the likelihood $P(D|\theta)$ follows binomial distribution. Since the conjugate prior of binomial distribution is Beta distribution, we use Beta distribution to express $P(\theta)$ here. Beta distribution is described as below.



Where, α and β are called hyperparameter, which cannot be determined by data. Rather we set them subjectively to express our prior knowledge well. For example, graphs below are some visualisation of Beta distribution with different values of α and β. You can see the top left graph is the one we used in the example above (expressing that θ=0.5 is the most likely value based on the prior knowledge), and the top right graph is also expressing the same prior knowledge but this one is for the believer that past seasons’ results are reflecting Liverpool’s true capability very well.
Figure 2. Visualizations of Beta distribution with different values of α and β

A note here about the bottom right graph: when α=1 and β=1, it means we don’t have any prior knowledge about $\theta$. In this case the estimation will be completely same as the one by MLE.

A note here about the bottom right graph: when α=1 and β=1, it means we don’t have any prior knowledge about θ. In this case the estimation will be completely same as the one by MLE.So, by now we have all the components to calculate $P(D|\theta)P(\theta)$ to maximise.

As same as MLE, we can get $\theta$ maximising this by having derivative of the this function with respect to $\theta$, and setting it to zero.


By solving this, we obtain following.


In this example, assuming we use α=10 and β=10, then $\theta=(30+10–1)/(38+10+10–2) = 39/56 = 69.6%$

Example:

Let's consider an example using a Bernoulli distribution with a known prior.

Given Data and Model

Suppose we have a set of observations from a Bernoulli process: x=[1,0,1,1,0,1,0,0,1,1]x = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]

We want to estimate the parameter pp

, the probability of success. Assume we have a Beta prior for 
pp: pBeta(α,β)p \sim \text{Beta}(\alpha, \beta)

where α\alpha and β\beta are the parameters of the Beta distribution.

Step-by-Step Calculation

  1. Likelihood: The likelihood of the data given pp is:

    P(xp)=pk(1p)nkP(x|p) = p^k (1 - p)^{n - k}

    where kk is the number of successes (sum of the data) and nn is the total number of observations.

  2. Prior: The prior distribution P(p)P(p) is given by the Beta distribution:

    P(p)pα1(1p)β1P(p) \propto p^{\alpha - 1} (1 - p)^{\beta - 1}
  3. Posterior: Combining the likelihood and the prior using Bayes' theorem:

    P(px)P(xp)P(p)P(p|x) \propto P(x|p) P(p)
    P(px)pk(1p)nkpα1(1p)β1P(p|x) \propto p^k (1 - p)^{n - k} \cdot p^{\alpha - 1} (1 - p)^{\beta - 1}
    P(px)pk+α1(1p)nk+β1P(p|x) \propto p^{k + \alpha - 1} (1 - p)^{n - k + \beta - 1}

    The posterior distribution is also a Beta distribution with updated parameters 
    α+k\alpha + k and β+nk:

    pxBeta(α+k,β+nk)p|x \sim \text{Beta}(\alpha + k, \beta + n - k)
  4. MAP Estimate: The mode of the Beta distribution (which is the MAP estimate) is given by:

    p^MAP=α+k1α+β+n2\hat{p}_{\text{MAP}} = \frac{\alpha + k - 1}{\alpha + \beta + n - 2}

    This formula is valid for α,β>1\alpha, \beta > 1.

Example with Specific Values

Let's assume α=2\alpha = 2 and β=2\beta = 2 for the prior. Given the data x=[1,0,1,1,0,1,0,0,1,1]x = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]

  • n=10n = 10 (total number of observations)
  • k=6k = 6 (number of successes)

The posterior distribution parameters are:

α=α+k=2+6=8\alpha' = \alpha + k = 2 + 6 = 8
β=β+nk=2+106=6\beta' = \beta + n - k = 2 + 10 - 6 = 6

So the MAP estimate for pp is:

p^MAP=818+62=7120.583\hat{p}_{\text{MAP}} = \frac{8 - 1}{8 + 6 - 2} = \frac{7}{12} \approx 0.583


********************************************************************************
More Examples
Example:
To illustrate Bayes rule, consider a medical diagnosis problem in which there are two alternative hypotheses: (1) that the patien; has a- articular form of cancer.and (2) that the patient does not. The available data is from a particular laboratory test with two possible outcomes: + (positive) and -(negative). We have prior knowledge that over the entire population of people only .008 have this disease.
Furthermore, the lab test is only an imperfect indicator of the disease. The test returns a correct positive result in only 98% of the cases in which the disease is actually present and a correct negative result in only 97% of the cases in which the disease is not present. In other cases, the test returns the opposite result. The above situation can be summarized by the following probabilities:

$P(cancer)=0.008$  $P(\neg cancer)=0.992$
$P(+|cancer)=0.98$  $P(-|cancer)=0.02$
$P(+|\neg cancer)=0.03$  $P(-|\neg cancer)=0.97$

Suppose we now observe a new patient for whom the lab test returns a positive result. Should we diagnose the patient as having cancer or not? The maximum a posteriori hypothesis can be found using Equation

$P(h|D)=P(D|h)*P(h)$
$P(cancer|+)=P(+|cancer)*P(cancer)=0.98*0.008=0.0078$
$P(\neg cancer|+)=P(+|\neg cancer)*P(\neg cancer)=0.03*0.992=0.0298$

Thus, $h_{MAP}= \neg cancer$

Example:
Suppose that X is a discrete random variable with the following probability
mass function: where $0 \le \theta \le 1$ is a parameter. The following 10 independent observations
were taken from such a distribution: (3,0,2,1,3,2,1,0,2,1). What is the maximum likelihood estimate of $\theta$.
$X \quad \quad \quad 0 \quad 1 \quad \quad 2 \quad \quad \quad 3$
$P(X) \quad \frac{2\theta}{3} \quad \frac{\theta}{3} \quad \frac{2(1 -\theta)}{3}\quad  \frac{(1 - \theta)}{3}$

Since the sample is (3,0,2,1,3,2,1,0,2,1), the likelihood is
$L(\theta) = P(X = 3)P(X = 0)P(X = 2)P(X = 1)P(X = 3)P(X = 2)P(X = 1)P(X = 0)P(X = 2)P(X = 1)$
Substituting from the probability distribution given above, we have
$L(\theta)=\prod P(X_i|\theta)=\left(\frac{2\theta}{3}\right)^2\left(\frac{\theta}{3}\right)^3\left(\frac{2(1-\theta)}{3}\right)^3\left(\frac{(1-\theta)}{3}\right)^2$

Clearly, the likelihood function $L(\theta)$ is not easy to maximize.
Let us look at the log likelihood function

$l(\theta)=log L(\theta)=\sum_{i=1}^n logP(X_i,\theta)$
$\quad \quad=2(log\frac{2}{3}+\log\theta)+3(log\frac{1}{3}+\log\theta)+3(log\frac{2}{3}+\log(1-\theta))+2(log\frac{1}{3}+\log(1-\theta))$
$\quad \quad = C + 5log \theta + 5 log(1- \theta)$
where C is a constant which does not depend on $\theta$. It can be seen that the log likelihood
function is easier to maximize compared to the likelihood function.
Let the derivative of $l(\theta)$ with respect to $\theta$ be zero:

$\frac{\mathrm{d} l(\theta)}{\mathrm{d} \theta}=\frac{5}{\theta}-\frac{5}{1-\theta}=0$
and the solution gives us the MLE, which is $\theta = 0.5$.

Example.
A coin is flipped 100 times.Given that there are 55 heads, find the maximum likelihood estimate for the probability $p$ of heads on a single toss.

We can think of counting the number of heads in 100 tosses as an experiment.For a given value of $p$ , the probability of getting 55 heads in this experiment is the Binomial Probability
$P(55\quad heads)=\binom{100}{55}p^{55}(1-p)^{45}$

The probability of getting 55 heads depends on value of $p$.So lets include this as a notion of conditional probability.
$P(55\quad heads|p)=\binom{100}{55}p^{55}(1-p)^{45}$

Experiment:Flip the coin 100 times and count the number of heads.
Data:The data is the result of the experiment.In this case it is`55heads'.
Parameter(s) of interest:We are interested in the value of the unknown parameter p.
Likelihood,or likelihood function:this is $P(data|p)$:Note it is a function of both the data and the parameter $p$.


In this case the likelihood is
$P(55\quad heads|p)=\binom{100}{55}p^{55}(1-p)^{45}$
Notes:1.The likelihood $P(data|p)$ changes as the parameter of interest $p$ changes.

Definition:Given data the maximum likelihood estimate(MLE)for the parameter $p$ is the value of $p$ that maximizes the likelihood $P(data|p)$.That is,the MLE is the value of $p$ for which the data is most likely.

In order to find the maximum value we can take derivative of the MLE function,set it to zero and solve for $p$.
$\frac{\mathrm{d} }{\mathrm{d} p} P(data|p)=\binom{100}{55}(55p^{54}(1-p)^{45}-p^{55}45(1-p)^{44})=0$
Solving this for $p$ we get
$55p^{54}(1-p)^{45}=p^{55}45(1-p)^{44}$
$55(1-p)=45p$
$55=100p$
$p=55/100=0.55$

So the MLE is 0.55
Note:
1.The MLE for $p$ turned out to be exactly the fraction of heads we saw in our data.
2.The MLE is computed from the data.That is,it is a statistic.
3.Officially you should check that the critical point is indeed a maximum.You can do this with the second derivative test.

Example .Suppose that the lifetime of bulbs is modeled by an exponential distribution with(unknown)parameter $\lambda$.We test 5 bulbs and find they have lifetimes of 2,3,1,3,and 4 years,respectively.What is the MLE for $\lambda$?


The Exponential Distribution: A continuous random variable $X$ is said to have an Exponential(λ) distribution if it has probability density function $f_X(x|λ) =  λe^{-λx}$ for $x > 0$ and  0 for $x ≤ 0$ , where $λ > 0$ is called the rate of the distribution.

Let $X_I$ be the life time of the $i^{th}$ bulb and let $x_i$ be the value $X_i$ takes, then each $X_i$ has a pdf  $f_{X_i}(x_i)=\lambda e^{- \lambda x_i}$. We assume that life time of bulbs are independent, so the joint pdf is the product of the individual densities:

$f(x_1,x_2,x_3,x_4,x_5|\lambda)=(\lambda e^{-\lambda x_1})(\lambda e^{-\lambda x_2})(\lambda e^{-\lambda x_3})(\lambda e^{-\lambda x_4})(\lambda e^{-\lambda x_5})$

$f(x_1,x_2,x_3,x_4,x_5|\lambda)=\lambda^5 e^{-\lambda(x_1+x_2+x_3+x_4+x_5)}$
$f(2,3,1,3,4|\lambda)=\lambda^5 e^{-\lambda(2+3+1+3+4)}$
$f(2,3,1,3,4|\lambda)=\lambda^5 e^{-13\lambda}$
$ln \quad f(2,3,1,3,4|\lambda)=ln(\lambda^5)+ln( e^{-13\lambda})=5ln(\lambda)-13\lambda$
$\frac{\mathrm{d} }{\mathrm{d} \lambda}log likely hood=\frac{5}{\lambda}-13=0$
$\lambda=\frac{5}{13}$
It is noted that the MLE $\lambda=\frac{5}{13}$, which is equal to the reciprocal of sample mean($\bar{x}$)
In general the MLE of $\lambda$ is
$\lambda=\frac{n}{x_1+,x_2,\cdots+x_n}$

Example
Normal distributions
Suppose the data $x_1,x_2,\ldots x_n$is drawn from a $N(\mu,\sigma)$ distribution,where $\mu$ and $\sigma$ are unknown.Find the maximum likelihood estimate for the pair $(\mu,\sigma)$.

Let uppercase $X_1,X_2\ldots,X_n$ be i.i.d. $N(\mu;\sigma)$ random variables,and let lower case $x_i$ be the value $X_i$ takes.The density for each $X_i$ is

$f_{X_i}(x_i)=\frac{1}{\sqrt{2\pi}\sigma}e^{\frac{-(x_i-\mu)^2}{2\sigma^2}}$

Since the $X_i$ are independent their joint pdf is the product of the individual pdf's:
$f(x_1,x_2,\ldots,x_n|\mu,\sigma)=\left(\frac{1}{\sqrt{2\pi}\sigma}\right )^ne^{-\sum_{i=1}^n\frac{(x_i-\mu)^2}{2\sigma^2}}$
The log likelihood is
$ln(f(x_1,x_2,\ldots,x_n|\mu,\sigma))=-nln(\sqrt{2\pi})-nln(\sigma)-\sum_{i=1}^n\frac{(x_i-\mu)^2}{2\sigma^2}$

Since $ln(f(x_1,x_2,\ldots,x_n|\mu,\sigma))$ is a function of two variables $\mu,\sigma$, we can use partial derivatives to find MLE

$\frac{\partial }{\partial\mu}ln(f(x_1,x_2,\ldots,x_n|\mu,\sigma))=\sum_{i=1}^n\frac{(x_i-\mu)}{\sigma^2}=0$
$\sum_{i=1}^n x_i=n\mu$
$\hat\mu=\frac{\sum_{i=1}^n x_i}{n}=\bar{x}$

To find $\hat{\sigma}$ we can differentiate and solve for $\sigma$

$\frac{\partial }{\partial\sigma}ln(f(x_1,x_2,\ldots,x_n|\mu,\sigma))=-\frac{n}{\sigma}+\sum_{i=1}^n\frac{(x_i-\mu)^2}{\sigma^3}=0$

$\hat{\sigma}^2=\sum_{i=1}^n\frac{(x_i-\mu)^2}{n}$

we already know that $\mu=\bar{x}$, so we can use that value.
So the MLE are

$\hat{\mu}=\bar{x}= $ mean of the data
$\hat{\sigma^2}=\sum_{i=1}^{n}\frac{1}{n}(x-\bar{x})^2$= variance of the data

Example
MLE of the Poisson Distribution..

Poisson distribution is a probability distribution that is used to show how many times an event is likely to occur over a specified period. In other words, it is a count distribution. Poisson distributions are often used to understand independent events that occur at a constant rate within a given interval of time. It was named after French mathematician Siméon Denis Poisson.


The Poisson distribution is a discrete function, meaning that the variable can only take specific values in a (potentially infinite) list. Put differently, the variable cannot take all values in any continuous range

The probability mass function is
$f(x)=\frac{\lambda^x}{x!}e^{-\lambda}$
$x$ is the number of occurrences
where  $\lambda$ is the Expected value of $x$ that is equal to its variance

The likelihood function is the product of the PMF of the observed values $x_1,x_2,\ldots,x_n$
$L(x_1,x_2,\ldots,x_n|\lambda)=\prod_{j=1}^n\frac{\lambda^{x_j}e^{-\lambda}}{x_j!}$

The log likilihood function is
$ln(L(x_1,x_2,\ldots,x_n|\lambda))=ln(\prod_{j=1}^n\frac{\lambda^{x_j}e^{-\lambda}}{x_j!})$
$=\sum_{j=1}^n ln(\lambda^{x_j})+ln(e^{-\lambda})-ln(x_j!)$
$=\sum_{j=1}^n x_j ln(\lambda)-\lambda-ln(x_j!)$
$=-n(\lambda)+ln(\lambda )\sum_{j=1}^n x_j-\sum_{j=1}^n ln(x_j!)$

Next, we can calculate the derivative of the natural log likelihood function with respect to the parameter λ:Set the derivative equal to zero and solve for λ.

$\frac{\mathrm{d} }{\mathrm{d} \lambda}ln(L(x_1,x_2,\ldots,x_n|\lambda))$
$=-n+\frac{1}{\lambda}\sum_{j=1}^n x_j$

Set the derivative equal to zero and solve for $\lambda$
$-n+\frac{1}{\lambda}\sum_{j=1}^n x_j=0$
$\lambda=\frac{1}{n}\sum_{j=1}^n x_j$

Thus the MLE turns out to be
$\lambda=\frac{1}{n}\sum_{j=1}^n x_j$

This is equivalent to the sample mean of the n observations in the sample.

Example Python Code
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt

# Generate some sample data from a normal distribution
np.random.seed(42)
true_mean = 2.0
true_std = 1.5
sample_size = 100
sample_data = np.random.normal(true_mean, true_std, sample_size)

# Define the likelihood function for a normal distribution
def likelihood(data, mean, std):
    return np.prod(norm.pdf(data, loc=mean, scale=std))

# Define the negative log likelihood function (to be minimized)
def negative_log_likelihood(params, data):
    mean, std = params
    return -np.log(likelihood(data, mean, std))

# Initial guess for mean and standard deviation
initial_guess = [0.0, 1.0]

# Use optimization to find the MLE estimates
from scipy.optimize import minimize
result = minimize(negative_log_likelihood, initial_guess, args=(sample_data,), method='Nelder-Mead')

# Extract the MLE estimates from the optimization result
estimated_mean, estimated_std = result.x

# Print the results
print("True Mean:", true_mean)
print("True Standard Deviation:", true_std)
print("Estimated Mean (MLE):", estimated_mean)
print("Estimated Standard Deviation (MLE):", estimated_std)

# Plot the histogram of the sample data and the true/estimated PDFs
plt.hist(sample_data, bins=20, density=True, alpha=0.6, label='Sample Data')
x_range = np.linspace(-5, 10, 400)
plt.plot(x_range, norm.pdf(x_range, loc=true_mean, scale=true_std), 'r', label='True PDF')
plt.plot(x_range, norm.pdf(x_range, loc=estimated_mean, scale=estimated_std), 'g', label='Estimated PDF (MLE)')
plt.legend()
plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Maximum Likelihood Estimation Example')
plt.show()

True Mean: 2.0 
True Standard Deviation: 1.5 
Estimated Mean (MLE): 1.8442158278764509 
Estimated Standard Deviation (MLE): 1.35541802281766



Comments

Popular posts from this blog

Concepts in Machine Learning- CST 383 KTU Minor Notes- Dr Binu V P

Overview of Machine Learning

Syllabus Concepts in Machine Learning- CST 383 KTU