Bayesian Formulation

 

Learn basics of probability from here https://mmlandpython.blogspot.com/2021/09/40-basics-of-probability.html

Bayesian Formulation

We live in a probabilistic world. Everything that happens has uncertainty attached to it. The Bayesian interpretation of probability is what Machine Learning is based upon. Bayesian probability means that we think of probability as quantifying the uncertainty of an event.

Because of this, we have to base our probabilities on the information available about an event, rather than counting the number of repeated trials. For example, when predicting a football match, instead of counting the total amount of times Manchester United have won against Liverpool, a Bayesian approach would use relevant information such as the current form, league placing and starting team.

The benefit of taking this approach is that probabilities can still be assigned to rare events, as the decision making process is based on relevant features and reasoning.

Thomas Bayes, founded ideas that are essential in the probability theory that is manifested into Machine Learning.Bayes Theorem provides a principled way for calculating a conditional probability.

Although it is a powerful tool in the field of probability, Bayes Theorem is also widely used in the field of machine learning. Including its use in a probability framework for fitting a model to a training dataset, referred to as maximum a posteriori or MAP for short, and in developing models for classification predictive modeling problems such as the Bayes Optimal Classifier and Naive Bayes.

Bayes Theorem of Conditional Probability

Recall that marginal probability is the probability of an event, irrespective of other random variables. If the random variable is independent, then it is the probability of the event directly, otherwise, if the variable is dependent upon other variables, then the marginal probability is the probability of the event summed over all outcomes for the dependent variables, called the sum rule.

Marginal Probability: The probability of an event irrespective of the outcomes of other random variables, e.g. $P(A)$.

The joint probability is the probability of two (or more) simultaneous events, often described in terms of events $A$ and B from two dependent random variables, e.g. $X$ and $Y$. The joint probability is often summarized as just the outcomes, e.g. $A$ and $B$.

Joint Probability: Probability of two (or more) simultaneous events, e.g. $P(A and B)$ or $P(A, B) or P(A \wedge B)$  .

The conditional probability is the probability of one event given the occurrence of another event, often described in terms of events $A$ and $B$ from two dependent random variables e.g. $X$ and $Y$.

Conditional Probability: Probability of one (or more) event given the occurrence of another event, e.g. $P(A given B)$ or $P(A | B)$.

The joint probability can be calculated using the conditional probability; for example:
$P(A, B) = P(A | B) * P(B)$

This is called the product rule. Importantly, the joint probability is symmetrical, meaning that:
$P(A, B) = P(B, A)$

The conditional probability can be calculated using the joint probability; for example:
$P(A | B) = P(A, B) / P(B)$

The conditional probability is not symmetrical; for example:
$P(A | B) != P(B | A)$

Specifically, one conditional probability can be calculated using the other conditional probability; for example:
$P(A|B) = \frac{P(B|A) * P(A) }{ P(B)}$

The reverse is also true; for example:
$P(B|A) = \frac{P(A|B) * P(B) }{ P(A)}$

This alternate approach of calculating the conditional probability is useful either when the joint probability is challenging to calculate (which is most of the time), or when the reverse conditional probability is available or easy to calculate.
This alternate calculation of the conditional probability is referred to as Bayes Rule or Bayes Theorem, named for Reverend Thomas Bayes, who is credited with first describing it.

Bayes Theorem: Principled way of calculating a conditional probability without the joint probability.

It is often the case that we do not have access to the denominator directly, e.g. $P(B)$.

We can calculate it an alternative way; for example:
$P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)$

This gives a formulation of Bayes Theorem that we can use that uses the alternate calculation of $P(B)$, described below:
$P(A|B) =\frac{ P(B|A) * P(A)}{ P(B|A) * P(A) + P(B|not A) * P(not A)}$

Note: the denominator is simply the expansion we gave above.

As such, if we have $P(A)$, then we can calculate $P(not A)$ as its complement; for example:
$P(not A) = 1 – P(A)$

Additionally, if we have $P(not B|not A)$, then we can calculate $P(B|not A)$ as its complement; for example:
$P(B|not A) = 1 – P(not B|not A)$

Naming the Terms in the Theorem

The terms in the Bayes Theorem equation are given names depending on the context where the equation is used.

It can be helpful to think about the calculation from these different perspectives and help to map your problem onto the equation.

Firstly, in general, the result $P(A|B)$ is referred to as the posterior probability and $P(A)$ is referred to as the prior probability.
$P(A|B)$: Posterior probability.
$P(A)$: Prior probability.


Sometimes $P(B|A)$ is referred to as the likelihood and $P(B)$ is referred to as the evidence.
$P(B|A)$: Likelihood.
$P(B)$: Evidence.


This allows Bayes Theorem to be restated as:
Posterior = (Likelihood * Prior ) / Evidence

We can make this clear with a smoke and fire case.

What is the probability that there is fire given that there is smoke?
P(Fire|Smoke) = P(Smoke|Fire) * P(Fire) / P(Smoke)

Where P(Fire) is the Prior, P(Smoke|Fire) is the Likelihood, and P(Smoke) is the evidence:


Example Diagnostic Test Scenario

An excellent and widely used example of the benefit of Bayes Theorem is in the analysis of a medical diagnostic test.

Scenario: Consider a human population that may or may not have cancer (Cancer is True or False) and a medical test that returns positive or negative for detecting cancer (Test is Positive or Negative), e.g. like a mammogram for detecting breast cancer.

Problem: If a randomly selected patient has the test and it comes back positive, what is the probability that the patient has cancer?

Manual Calculation

Medical diagnostic tests are not perfect; they have error.

Sometimes a patient will have cancer, but the test will not detect it. This capability of the test to detect cancer is referred to as the sensitivity, or the true positive rate.

In this case, we will contrive a sensitivity value for the test. The test is good, but not great, with a true positive rate or sensitivity of 85%. That is, of all the people who have cancer and are tested, 85% of them will get a positive result from the test.

$P(Test=Positive | Cancer=True) = 0.85$

Given this information, our intuition would suggest that there is an 85% probability that the patient has cancer.

Our intuitions of probability are wrong.

This type of error in interpreting probabilities is so common that it has its own name; it is referred to as the base rate fallacy.

It has this name because the error in estimating the probability of an event is caused by ignoring the base rate. That is, it ignores the probability of a randomly selected person having cancer, regardless of the results of a diagnostic test.

In this case, we can assume the probability of breast cancer is low, and use a contrived base rate value of one person in 5,000, or (0.0002) 0.02%.
$P(Cancer=True) = 0.02%.$

We can correctly calculate the probability of a patient having cancer given a positive test result using Bayes Theorem.

Let’s map our scenario onto the equation:
$P(A|B) = P(B|A) * P(A) / P(B)$
$P(Cancer=True | Test=Positive) = P(Test=Positive|Cancer=True) * P(Cancer=True) / P(Test=Positive)$

We know the probability of the test being positive given that the patient has cancer is 85%, and we know the base rate or the prior probability of a given patient having cancer is 0.02%; we can plug these values in:

$P(Cancer=True | Test=Positive) = 0.85 * 0.0002 / P(Test=Positive)$

We don’t know $P(Test=Positive)$, it’s not given directly.

Instead, we can estimate it using:
$P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)$
$P(Test=Positive) = P(Test=Positive|Cancer=True) * P(Cancer=True) + P(Test=Positive|Cancer=False) * P(Cancer=False)$

Firstly, we can calculate $P(Cancer=False)$ as the complement of $P(Cancer=True)$, which we already know
$P(Cancer=False) = 1 – P(Cancer=True)$
$= 1 – 0.0002$
$= 0.9998$

Let’s plugin what we have:

We can plug in our known values as follows:
$P(Test=Positive) = 0.85 * 0.0002 + P(Test=Positive|Cancer=False) * 0.9998$

We still do not know the probability of a positive test result given no cancer.

This requires additional information.

Specifically, we need to know how good the test is at correctly identifying people that do not have cancer. That is, testing negative result (Test=Negative) when the patient does not have cancer (Cancer=False), called the true negative rate or the specificity.

We will use a contrived specificity value of 95%.
$P(Test=Negative | Cancer=False) = 0.95$

With this final piece of information, we can calculate the false positive or false alarm rate as the complement of the true negative rate.
$P(Test=Positive|Cancer=False) = 1 – P(Test=Negative | Cancer=False)$
$= 1 – 0.95$
$= 0.05$

We can plug this false alarm rate into our calculation of P(Test=Positive) as follows:
$P(Test=Positive) = 0.85 * 0.0002 + 0.05 * 0.9998$
$P(Test=Positive) = 0.00017 + 0.04999$
$P(Test=Positive) = 0.05016$

Excellent, so the probability of the test returning a positive result, regardless of whether the person has cancer or not is about 5%.

We now have enough information to calculate Bayes Theorem and estimate the probability of a randomly selected person having cancer if they get a positive test result.
$P(Cancer=True | Test=Positive) = P(Test=Positive|Cancer=True) * P(Cancer=True) / P(Test=Positive)$
$P(Cancer=True | Test=Positive) = 0.85 * 0.0002 / 0.05016$
$P(Cancer=True | Test=Positive) = 0.00017 / 0.05016$
$P(Cancer=True | Test=Positive) = 0.003389154704944$

The calculation suggests that if the patient is informed they have cancer with this test, then there is only 0.33% chance that they have cancer.

It is a terrible diagnostic test!

The example also shows that the calculation of the conditional probability requires enough information.

For example, if we have the values used in Bayes Theorem already, we can use them directly.

This is rarely the case, and we typically have to calculate the bits we need and plug them in, as we did in this case. In our scenario we were given 3 pieces of information, the the base rate, the sensitivity (or true positive rate), and the specificity (or true negative rate).
Sensitivity: 85% of people with cancer will get a positive test result.
Base Rate: 0.02% of people have cancer.
Specificity: 95% of people without cancer will get a negative test result.

We did not have the P(Test=Positive), but we calculated it given what we already had available.

We might imagine that Bayes Theorem allows us to be even more precise about a given scenario. For example, if we had more information about the patient (e.g. their age) and about the domain (e.g. cancer rates for age ranges), and in turn we could offer an even more accurate probability estimate.


Bayes Theorem for Modeling Hypotheses

Bayes Theorem is a useful tool in applied machine learning.

It provides a way of thinking about the relationship between data and a model.

A machine learning algorithm or model is a specific way of thinking about the structured relationships in the data. In this way, a model can be thought of as a hypothesis about the relationships in the data, such as the relationship between input (X) and output (y). The practice of applied machine learning is the testing and analysis of different hypotheses (models) on a given dataset.

Bayes Theorem provides a probabilistic model to describe the relationship between data (D) and a hypothesis (h); for example:

$P(h|D) = P(D|h) * P(h) / P(D)$

Breaking this down, it says that the probability of a given hypothesis holding or being true given some observed data can be calculated as the probability of observing the data given the hypothesis multiplied by the probability of the hypothesis being true regardless of the data, divided by the probability of observing the data regardless of the hypothesis.

Under this framework, each piece of the calculation has a specific name; for example:
P(h|D): Posterior probability of the hypothesis (the thing we want to calculate).
P(h): Prior probability of the hypothesis.

This gives a useful framework for thinking about and modeling a machine learning problem.

If we have some prior domain knowledge about the hypothesis, this is captured in the prior probability. If we don’t, then all hypotheses may have the same prior probability.

If the probability of observing the data $P(D)$ increases, then the probability of the hypothesis holding given the data $P(h|D)$ decreases. Conversely, if the probability of the hypothesis $P(h)$ and the probability of observing the data given hypothesis increases, the probability of the hypothesis holding given the data $P(h|D)$ increases.

The notion of testing different models on a dataset in applied machine learning can be thought of as estimating the probability of each hypothesis (h1, h2, h3, … in H) being true given the observed data.

The optimization or seeking the hypothesis with the maximum posterior probability in modeling is called maximum a posteriori or MAP for short.

Any such maximally probable hypothesis is called a Maximum A Posteriori (MAP) hypothesis. We can determine the MAP hypotheses by using Bayes theorem to calculate the posterior probability of each candidate hypothesis.

Under this framework, the probability of the data (D) is constant as it is used in the assessment of each hypothesis. Therefore, it can be removed from the calculation to give the simplified unnormalized estimate as follows:

$max_{ h \in H }   P(h|D) = P(D|h) * P(h)$

If we do not have any prior information about the hypothesis being tested, they can be assigned a uniform probability, and this term too will be a constant and can be removed from the calculation to give the following:

$max_{ h \in H} P(h|D) = P(D|h)$

That is, the goal is to locate a hypothesis that best explains the observed data.

Fitting models like linear regression for predicting a numerical value, and logistic regression for binary classification can be framed and solved under the MAP probabilistic framework. This provides an alternative to the more common Maximum Likelihood Estimation (MLE) framework.

Bayes Theorem for Classification

Classification is a predictive modeling problem that involves assigning a label to a given input data sample.

The problem of classification predictive modeling can be framed as calculating the conditional probability of a class label given a data sample, for example:

$P(class|data) = (P(data|class) * P(class)) / P(data)$

Where $P(class|data)$ is the probability of class given the provided data.

This calculation can be performed for each class in the problem and the class that is assigned the largest probability can be selected and assigned to the input data.

In practice, it is very challenging to calculate full Bayes Theorem for classification.

The priors for the class and the data are easy to estimate from a training dataset, if the dataset is suitability representative of the broader problem.

The conditional probability of the observation based on the class $P(data|class)$ is not feasible unless the number of examples is extraordinarily large, e.g. large enough to effectively estimate the probability distribution for all different possible combinations of values. This is almost never the case, we will not have sufficient coverage of the domain.

As such, the direct application of Bayes Theorem also becomes intractable, especially as the number of variables or features (n) increases.

Naive Bayes Classifier

The solution to using Bayes Theorem for a conditional probability classification model is to simplify the calculation.

The Bayes Theorem assumes that each input variable is dependent upon all other variables. This is a cause of complexity in the calculation. We can remove this assumption and consider each input variable as being independent from each other.

This changes the model from a dependent conditional probability model to an independent conditional probability model and dramatically simplifies the calculation.

This means that we calculate $P(data|class)$ for each input variable separately and multiple the results together, for example:
$P(class | X_1, X_2,\ldots, X_n) = P(X_1|class) \times P(X_2|class) \times \ldots  \times P(X_n|class) \times P(class) / P(data)$

We can also drop the probability of observing the data as it is a constant for all calculations, for example:
$P(class | X_1, X_2, \ldots, X_n) = P(X_1|class) \times P(X_2|class) \times \ldots  \times P(X_n|class) \times P(class)$

This simplification of Bayes Theorem is common and widely used for classification predictive modeling problems and is generally referred to as Naive Bayes.

Bayes Optimal Classifier

The Bayes optimal classifier is a probabilistic model that makes the most likely prediction for a new example, given the training dataset.

This model is also referred to as the Bayes optimal learner, the Bayes classifier, Bayes optimal decision boundary, or the Bayes optimal discriminant function.

Bayes Classifier: Probabilistic model that makes the most probable prediction for new examples.

Specifically, the Bayes optimal classifier answers the question:
What is the most probable classification of the new instance given the training data?

This is different from the MAP framework that seeks the most probable hypothesis (model). Instead, we are interested in making a specific prediction.

The equation below demonstrates how to calculate the conditional probability for a new instance $(v_j)$ given the training data $(D)$, given a space of hypotheses $(H)$.
$P(v_j | D) = \sum_{h \in H} P(v_j | h_i) * P(h_i | D)$

Where $v_j$ is a new instance to be classified, H is the set of hypotheses for classifying the instance, $h_i$ is a given hypothesis, $P(v_j | h_i)$ is the posterior probability for $v_j$ given hypothesis $h_i$, and $P(h_i | D)$ is the posterior probability of the hypothesis $h_i$ given the data $D$.

Selecting the outcome with the maximum probability is an example of a Bayes optimal classification.

Any model that classifies examples using this equation is a Bayes optimal classifier and no other model can outperform this technique, on average.

We have to let that sink in. It is a big deal.

Because the Bayes classifier is optimal, the Bayes error is the minimum possible error that can be made.
Bayes Error: The minimum possible error that can be made when making predictions.

It is a theoretical model, but it is held up as an ideal that we may wish to pursue.

The Naive Bayes classifier is an example of a classifier that adds some simplifying assumptions and attempts to approximate the Bayes Optimal Classifier.

Comments

Popular posts from this blog

Concepts in Machine Learning- CST 383 KTU Minor Notes- Dr Binu V P

Overview of Machine Learning

Syllabus Concepts in Machine Learning- CST 383 KTU