Naive Bayes Classifier
Classification is one of the most used forms of prediction where the goal is to predict the class of the record. For binary classification, we aim to predict whether a record is a 1 or a 0 such as spam/not spam or churn/not churn and for multiclass classification, we aim to predict the class of a record such as classifying a mail as primary/social/promotional, etc.
In addition to predicting the class of our records most of the time, we also want to know the predicted probability of belonging to the class of interest. These probability values are also called propensity scores. We can set a cutoff probability for the class of interest and for the records having propensity score below we consider the record belonging to that class, and vice versa.
For classification problem that is a form of supervised learning we start with a labeled data where the class of the records is known, we train our model using this data, and then apply the model to the new data where the class is unknown.
In this , we will see how to use Naive Bayes algorithm for multiclass classification problem by implementing in Python.
Naive Bayes Classifier
Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving classification problems.
It is mainly used in text classification that includes a high-dimensional training dataset.
Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in building the fast machine learning models that can make quick predictions.
It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and classifying articles.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:
Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of the occurrence of other features. Such as if the fruit is identified on the bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually contributes to identify that it is an apple without depending on each other.
Bayes’ Theorem
In machine learning we are often interested in selecting the best hypothesis (h) given data (d).
In a classification problem, our hypothesis (h) may be the class to assign for a new data instance (d).
Bayes’ Theorem is stated as:
$P(h|d) = (P(d|h) * P(h)) / P(d)$
Where
$P(h|d)$ is the probability of hypothesis $h$ given the data $d$. This is called the posterior probability.
$P(d|h)$ is the probability of data $d$ given that the hypothesis $h$ was true.(likelihood)
$P(h)$ is the probability of hypothesis $h$ being true (regardless of the data). This is called the prior probability of $h$.
$P(d)$ is the probability of the data (regardless of the hypothesis).(prior probability of the predictor)
You can see that we are interested in calculating the posterior probability of $P(h|d)$ from the prior probability $p(h)$ with $P(d)$ and $P(d|h)$.
This can be written as:
$MAP(h) = max(P(h|d))$
or
$MAP(h) = max((P(d|h) * P(h)) / P(d))$
or
$MAP(h) = max(P(d|h) * P(h))$
$MAP(h) = max(P(d|h))$
Naive Bayes is a classification algorithm for binary (two-class) and multi-class classification problems. The technique is easiest to understand when described using binary or categorical input values.
Example
In machine learning we are often interested in selecting the best hypothesis (h) given data (d).
In a classification problem, our hypothesis (h) may be the class to assign for a new data instance (d).
One of the easiest ways of selecting the most probable hypothesis given the data that we have that we can use as our prior knowledge about the problem. Bayes’ Theorem provides a way that we can calculate the probability of a hypothesis given our prior knowledge.
Bayes’ Theorem is stated as:
$P(h|d) = (P(d|h) * P(h)) / P(d)$
Where
$P(h|d)$ is the probability of hypothesis $h$ given the data $d$. This is called the posterior probability.
$P(d|h)$ is the probability of data $d$ given that the hypothesis $h$ was true.(likelihood)
$P(h)$ is the probability of hypothesis $h$ being true (regardless of the data). This is called the prior probability of $h$.
$P(d)$ is the probability of the data (regardless of the hypothesis).(prior probability of the predictor)
You can see that we are interested in calculating the posterior probability of $P(h|d)$ from the prior probability $p(h)$ with $P(d)$ and $P(d|h)$.
After calculating the posterior probability for a number of different hypotheses, you can select the hypothesis with the highest probability. This is the maximum probable hypothesis and may formally be called the maximum a posteriori (MAP) hypothesis.
This can be written as:
$MAP(h) = max(P(h|d))$
or
$MAP(h) = max((P(d|h) * P(h)) / P(d))$
or
$MAP(h) = max(P(d|h) * P(h))$
The $P(d)$ is a normalizing term which allows us to calculate the probability. We can drop it when we are interested in the most probable hypothesis as it is constant and only used to normalize.
Back to classification, if we have an even number of instances in each class in our training data, then the probability of each class (e.g. $P(h)$) will be equal. Again, this would be a constant term in our equation and we could drop it so that we end up with:
$MAP(h) = max(P(d|h))$
It is called naive Bayes or idiot Bayes because the calculation of the probabilities for each hypothesis are simplified to make their calculation tractable. Rather than attempting to calculate the values of each attribute value $P(d1, d2, d3|h)$, they are assumed to be conditionally independent given the target value and calculated as $P(d1|h) * P(d2|h)$ and so on.
This is a very strong assumption that is most unlikely in real data, i.e. that the attributes do not interact. Nevertheless, the approach performs surprisingly well on data where this assumption does not hold.
Consider the spam mail detection problem
If the word 'money' is present in the mail, what is the probability that it is spam
$P(spam|money)=\frac{P(money|spam).P(spam)}{P(money)}$ which is the posterior probability
Likelihood is the probability of the evidence happen given that event is true i.e. $P(money|spam)$ is the probability of mail includes word “money” given that the mail is spam.
Prior Probability is the probability of an event before new data is collected i.e. $P(spam)$ is the probability of spam mails before any new mail is seen.
Marginal Likelihood also called evidence is the probability of the evidence event to occur i.e. $P(money)$ is the probability of mails include word “money” in the text.
Maximum a posteriori (MAP) is the hypothesis with the highest posterior probability. After calculating the posterior probability for several hypotheses we select the hypothesis with the highest probability.
Example: If $P(spam|money) > P(not spam|money)$ then we can say that the mail can be classified as spam. This is the maximum probable hypothesis.
The representation for naive Bayes is probabilities.
A list of probabilities are stored to file for a learned naive Bayes model. This includes:
Class Probabilities: The probabilities of each class in the training dataset.
Conditional Probabilities: The conditional probabilities of each input value given each class value.
Calculating Class Probabilities
The class probabilities are simply the frequency of instances that belong to each class divided by the total number of instances.
$P(class=1) = count(class=1) / (count(class=0) + count(class=1))$
In the simplest case each class would have the probability of 0.5 or 50% for a binary classification problem with the same number of instances in each class.
The conditional probabilities are the frequency of each attribute value for a given class value divided by the frequency of instances with that class value.
For example, if a “weather” attribute had the values “sunny” and “rainy” and the class attribute had the class values “go out” and “stay home“, then the conditional probabilities of each weather value for each class value could be calculated as:
$P(weather=sunny|class=goout) = \frac{count(instances with weather=sunny and class=goout)}{ count(instances with class=goout)}$
$P(weather=sunny|class=stay home) =\frac{count(instances with weather=sunny and class=stay home)} {count(instances with class=stay home)}$
$P(weather=rainy|class=go out) = \frac{count(instances with weather=rainy and class=go out)} {count(instances with class=go out)}$
$P(weather=rainy|class=stay home) =\frac{count(instances with weather=rainy and class=stay home)}
{count(instances with class=stay home)}$
Given a naive Bayes model, you can make predictions for new data using Bayes theorem.
$MAP(h) = max(P(d|h) * P(h))$
Using our example above, if we had a new instance with the weather of sunny, we can calculate:
$goout = P(weather=sunny|class=goout) * P(class=goout)$
$stayhome = P(weather=sunny|class=stayhome) * P(class=stay home)$
We can choose the class that has the largest calculated value. We can turn these values into probabilities by normalizing them as follows:
$P(go out|weather=sunny) = goout / (go out + stay home)$
$P(stay home|weather=sunny) = stay home / (go out + stay home)$
If we had more input variables we could extend the above example. For example, pretend we have a “car” attribute with the values “working” and “broken“. We can multiply this probability into the equation.
For example below is the calculation for the “go out” class label with the addition of the car input variable set to “working”:
$go out = P(weather=sunny|class=go out) * P(car=working|class=go out) * P(class=go out)$
Example-1
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using this dataset we need to decide that whether we should play or not on a particular day according to the weather conditions. So to solve this problem, we need to follow the below steps:
Generate Likelihood table by finding the probabilities of given features.
Now, use Bayes theorem to calculate the posterior probability.
Problem: If the weather is sunny, then the Player should play or not?
No | weather | play |
1 | Rainy | Yes |
2 | Sunny | Yes |
3 | Overcast | Yes |
4 | Overcast | Yes |
5 | Sunny | No |
6 | Rainy | Yes |
7 | Sunny | Yes |
8 | Overcast | Yes |
9 | Rainy | No |
10 | Sunny | No |
11 | Sunny | Yes |
12 | Rainy | No |
13 | Overcast | Yes |
14 | Overcast | Yes |
Frequency table for the Weather Conditions:
Weather | No | Yes |
Overcast | 0 | 5 |
Rainy | 2 | 2 |
Sunny | 2 | 3 |
Total | 4 | 10 |
Likelihood table weather condition:
$P(Sunny|Yes)= 3/10= 0.3$
$P(Sunny)= 0.35$
$P(Yes)=0.71$
So $P(Yes|Sunny) = 0.3*0.71/0.35= 0.60$
$P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)$
$P(Sunny|NO)= 2/4=0.5$
$P(No)= 0.29$
$P(Sunny)= 0.35$
So $P(No|Sunny)= 0.5*0.29/0.35 = 0.41$
Weather | No | Yes | |
Overcast | 0 | 5 | 5/14= 0.35 |
Rainy | 2 | 2 | 4/14=0.29 |
Sunny | 2 | 3 | 5/14=0.35 |
All | 4/14=0.29 | 10/14=0.71 |
Applying Bayes'theorem:
$P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)$$P(Sunny|Yes)= 3/10= 0.3$
$P(Sunny)= 0.35$
$P(Yes)=0.71$
So $P(Yes|Sunny) = 0.3*0.71/0.35= 0.60$
$P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)$
$P(Sunny|NO)= 2/4=0.5$
$P(No)= 0.29$
$P(Sunny)= 0.35$
So $P(No|Sunny)= 0.5*0.29/0.35 = 0.41$
So as we can see from the above calculation that $P(Yes|Sunny)>P(No|Sunny)$
Hence on a Sunny day, Player can play the game.
Example-2
This example dataset contains examples of the different conditions that are associated with accidents. The target variable accident is a binary categorical variable with yes/no values. There are 4 categorical features: weather condition, road condition, traffic condition, and engine problem.
Prior probability computation:
There are 10 data points (m = 10). There are 5 instances of class/label ‘yes’ ( 𝑁Accident_Y = 5), and 5 instances of class/label ‘no’ ( NAccident_N = 5). The prior probabilities can be computed using the equation for prior probability
$𝑃(Accident_{yes}) = 5/10$
$P(Accident_{no}) = 5/10$
Class conditional probability computation:
The dataset is split based on the target labels (yes/no) first. Since there are 2 classes for the target variable we get 2 sub-tables. If the target variable had 3 classes we would get 3 sub-tables, one for each of the classes.
The class conditional probability can be computed using the Table-4 and Table-5
Predicting posterior probability:
Suppose we are now given a new feature vector:
Weather condition: rain
Road condition: good
Traffic condition: normal
Engine problem: no
The task is to predict if an accident will happen?
The posterior probability for each of the target classes is computed using the equation for posterior probability
The probabilities can be obtained by normalizing the posterior probabilities:
Advantages of Naïve Bayes Classifier:
It can be used for Binary as well as Multi-class Classifications.
It performs well in Multi-class predictions as compared to the other Algorithms.
It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship between features.
Applications of Naïve Bayes Classifier:
It is used for Credit Scoring.
It is used in medical data classification.
It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
It is used in Text classification such as Spam filtering and Sentiment analysis.
Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for making predictions in real time.
Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here we can predict the probability of multiple classes of target variable.
Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text classification (due to better result in multi class problems and independence rule) have higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments)
Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not
There are three types of Naive Bayes Model, which are given below:
Gaussian: The Gaussian model assumes that features follow a normal distribution. This means if predictors take continuous values instead of discrete, then the model assumes that these values are sampled from the Gaussian distribution.
Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial distributed. It is primarily used for document classification problems, it means a particular document belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor variables are the independent Booleans variables. Such as if a particular word is present or not in a document. This model is also famous for document classification tasks.
Example with python code
download the titanic.csv file from kaggle before doing this program
#import the libraries and load the titanic.csv file
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
df=pd.read_csv('titanic.csv')
# do some data pre processing drop unwanted columns and setup input and target fileds
df.drop(['PassengerId','Name','SibSp','Parch','Ticket','Cabin','Embarked'],axis=1,inplace=True)
target=df.Survived
inputs=df.drop(['Survived'],axis=1)
print(inputs.head(3))
print(target.head(3))
Pclass Sex Age Fare 0 3 male 22.0 7.2500 1 1 female 38.0 71.2833 2 3 female 26.0 7.9250 0 0 1 1 2 1#do one hot encoding on sex column and drop it
dummies=pd.get_dummies(inputs.Sex)
dummies.head(3)
female male
01
10
10
inputs=pd.concat([inputs,dummies],axis=1)
inputs.head(3)
inputs.drop(['Sex'],axis=1,inplace=True)
# fill all null column values in age column with median value
inputs.columns[inputs.isna().any()]
inputs.Age=inputs.Age.fillna(inputs.Age.mean())
#do a train test data split 80:20
X_train,X_test,y_train,y_test=train_test_split(inputs,target,test_size=0.2)
print(len(inputs))
print(len(X_train))
print(len(X_test))
#train the Gaussian Naive Base model and compute the score
from sklearn.naive_bayes import GaussianNB
model=GaussianNB()
model.fit(X_train,y_train)
model.score(X_test,y_test)
0.7374301675977654#check the prediction with actual values consider 10 records
print(y_test[:10])
210 0 321 0 554 1 732 0 741 0 356 1 44 1 349 0 272 1 341 1model.predict(X_test[:10])
array([0, 0, 1, 0, 0, 1, 1, 0, 1, 1], dtype=int64)
model.predict_proba(X_test[:10])
array([[9.99948794e-01, 5.12059773e-05], [9.99954949e-01, 4.50513155e-05], [1.54513682e-06, 9.99998455e-01], [5.52541073e-09, 9.99999994e-01], [9.99618589e-01, 3.81410872e-04], [1.65503493e-06, 9.99998345e-01], [6.05856963e-07, 9.99999394e-01], [6.82368619e-07, 9.99999318e-01], [1.95886497e-07, 9.99999804e-01], [1.64566569e-06, 9.99998354e-01]])
Refer and learn more....text book chapter included
Comments
Post a Comment