Naive Bayes Classifier

Classification is one of the most used forms of prediction where the goal is to predict the class of the record. For binary classification, we aim to predict whether a record is a 1 or a 0 such as spam/not spam or churn/not churn and for multiclass classification, we aim to predict the class of a record such as classifying a mail as primary/social/promotional, etc.

In addition to predicting the class of our records most of the time, we also want to know the predicted probability of belonging to the class of interest. These probability values are also called propensity scores. We can set a cutoff probability for the class of interest and for the records having propensity score below we consider the record belonging to that class, and vice versa.

For classification problem that is a form of supervised learning we start with a labeled data where the class of the records is known, we train our model using this data, and then apply the model to the new data where the class is unknown.

In this , we will see how to use Naive Bayes algorithm for multiclass classification problem by implementing in Python.

Naive Bayes Classifier

Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving classification problems.

It is mainly used in text classification that includes a high-dimensional training dataset.

Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in building the fast machine learning models that can make quick predictions.

It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.

Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and classifying articles.

Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:

Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of the occurrence of other features. Such as if the fruit is identified on the bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually contributes to identify that it is an apple without depending on each other.

Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem

Bayes’ Theorem

In machine learning we are often interested in selecting the best hypothesis (h) given data (d).

In a classification problem, our hypothesis (h) may be the class to assign for a new data instance (d).

One of the easiest ways of selecting the most probable hypothesis given the data that we have that we can use as our prior knowledge about the problem. Bayes’ Theorem provides a way that we can calculate the probability of a hypothesis given our prior knowledge.

Bayes’ Theorem is stated as:

$P(h|d) = (P(d|h) * P(h)) / P(d)$

Where
$P(h|d)$ is the probability of hypothesis $h$ given the data $d$. This is called the posterior probability.
$P(d|h)$ is the probability of data $d$ given that the hypothesis $h$ was true.(likelihood)
$P(h)$ is the probability of hypothesis $h$ being true (regardless of the data). This is called the prior probability of $h$.
$P(d)$ is the probability of the data (regardless of the hypothesis).(prior probability of the predictor)

You can see that we are interested in calculating the posterior probability of $P(h|d)$ from the prior probability $p(h)$ with $P(d)$ and $P(d|h)$.

After calculating the posterior probability for a number of different hypotheses, you can select the hypothesis with the highest probability. This is the maximum probable hypothesis and may formally be called the maximum a posteriori (MAP) hypothesis.

This can be written as:

$MAP(h) = max(P(h|d))$

or

$MAP(h) = max((P(d|h) * P(h)) / P(d))$

or

$MAP(h) = max(P(d|h) * P(h))$

The $P(d)$ is a normalizing term which allows us to calculate the probability. We can drop it when we are interested in the most probable hypothesis as it is constant and only used to normalize.

Back to classification, if we have an even number of instances in each class in our training data, then the probability of each class (e.g. $P(h)$) will be equal. Again, this would be a constant term in our equation and we could drop it so that we end up with:

$MAP(h) = max(P(d|h))$

Naive Bayes is a classification algorithm for binary (two-class) and multi-class classification problems. The technique is easiest to understand when described using binary or categorical input values.

It is called naive Bayes or idiot Bayes because the calculation of the probabilities for each hypothesis are simplified to make their calculation tractable. Rather than attempting to calculate the values of each attribute value $P(d1, d2, d3|h)$, they are assumed to be conditionally independent given the target value and calculated as $P(d1|h) * P(d2|h)$ and so on.

This is a very strong assumption that is most unlikely in real data, i.e. that the attributes do not interact. Nevertheless, the approach performs surprisingly well on data where this assumption does not hold.

Example

Consider the spam mail detection problem

If the word 'money' is present in the mail, what is the probability that it is spam

$P(spam|money)=\frac{P(money|spam).P(spam)}{P(money)}$ which is the posterior probability

Likelihood is the probability of the evidence happen given that event is true i.e. $P(money|spam)$ is the probability of mail includes word “money” given that the mail is spam.

Prior Probability is the probability of an event before new data is collected i.e. $P(spam)$ is the probability of spam mails before any new mail is seen.

Marginal Likelihood also called evidence is the probability of the evidence event to occur i.e. $P(money)$ is the probability of mails include word “money” in the text.

Maximum a posteriori (MAP) is the hypothesis with the highest posterior probability. After calculating the posterior probability for several hypotheses we select the hypothesis with the highest probability.

Example: If $P(spam|money) > P(not spam|money)$ then we can say that the mail can be classified as spam. This is the maximum probable hypothesis.

Representation Used By Naive Bayes Models

The representation for naive Bayes is probabilities.

A list of probabilities are stored to file for a learned naive Bayes model. This includes:
Class Probabilities: The probabilities of each class in the training dataset.
Conditional Probabilities: The conditional probabilities of each input value given each class value.

Calculating Class Probabilities

The class probabilities are simply the frequency of instances that belong to each class divided by the total number of instances.

For example in a binary classification the probability of an instance belonging to class 1 would be calculated as:

$P(class=1) = count(class=1) / (count(class=0) + count(class=1))$

In the simplest case each class would have the probability of 0.5 or 50% for a binary classification problem with the same number of instances in each class.

Calculating Conditional Probabilities ( based on the example given below)

The conditional probabilities are the frequency of each attribute value for a given class value divided by the frequency of instances with that class value.

For example, if a “weather” attribute had the values “sunny” and “rainy” and the class attribute had the class values “go out” and “stay home“, then the conditional probabilities of each weather value for each class value could be calculated as:

$P(weather=sunny|class=goout) = \frac{count(instances with weather=sunny and class=goout)}{ count(instances with class=goout)}$

$P(weather=sunny|class=stay home) =\frac{count(instances with weather=sunny and class=stay home)} {count(instances with class=stay home)}$

$P(weather=rainy|class=go out) = \frac{count(instances with weather=rainy and class=go out)} {count(instances with class=go out)}$

$P(weather=rainy|class=stay home) =\frac{count(instances with weather=rainy and class=stay home)}
{count(instances with class=stay home)}$

Make Predictions With a Naive Bayes Model

Given a naive Bayes model, you can make predictions for new data using Bayes theorem.

$MAP(h) = max(P(d|h) * P(h))$

Using our example above, if we had a new instance with the weather of sunny, we can calculate:

$goout = P(weather=sunny|class=goout) * P(class=goout)$
$stayhome = P(weather=sunny|class=stayhome) * P(class=stay home)$

We can choose the class that has the largest calculated value. We can turn these values into probabilities by normalizing them as follows:

$P(go out|weather=sunny) = goout / (go out + stay home)$

$P(stay home|weather=sunny) = stay home / (go out + stay home)$

If we had more input variables we could extend the above example. For example, pretend we have a “car” attribute with the values “working” and “broken“. We can multiply this probability into the equation.

For example below is the calculation for the “go out” class label with the addition of the car input variable set to “working”:

$go out = P(weather=sunny|class=go out) * P(car=working|class=go out) * P(class=go out)$

Example-1

Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using this dataset we need to decide that whether we should play or not on a particular day according to the weather conditions. So to solve this problem, we need to follow the below steps:

Convert the given dataset into frequency tables.
Generate Likelihood table by finding the probabilities of given features.
Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

No	weather	play
1	Rainy	Yes
2	Sunny	Yes
3	Overcast	Yes
4	Overcast	Yes
5	Sunny	No
6	Rainy	Yes
7	Sunny	Yes
8	Overcast	Yes
9	Rainy	No
10	Sunny	No
11	Sunny	Yes
12	Rainy	No
13	Overcast	Yes
14	Overcast	Yes

Frequency table for the Weather Conditions:
Weather No Yes
Overcast 0 5
Rainy 2 2
Sunny 2 3
Total 4 10

Likelihood table weather condition:

Weather	No	Yes
Overcast	0	5	5/14= 0.35
Rainy	2	2	4/14=0.29
Sunny	2	3	5/14=0.35
All	4/14=0.29	10/14=0.71

So as we can see from the above calculation that $P(Yes|Sunny)>P(No|Sunny)$

Hence on a Sunny day, Player can play the game.

Example-2

This example dataset contains examples of the different conditions that are associated with accidents. The target variable accident is a binary categorical variable with yes/no values. There are 4 categorical features: weather condition, road condition, traffic condition, and engine problem.

Prior probability computation:

There are 10 data points (m = 10). There are 5 instances of class/label ‘yes’ ( 𝑁Accident_Y = 5), and 5 instances of class/label ‘no’ ( NAccident_N = 5). The prior probabilities can be computed using the equation for prior probability

$𝑃(Accident_{yes}) = 5/10$

$P(Accident_{no}) = 5/10$

Class conditional probability computation:

The dataset is split based on the target labels (yes/no) first. Since there are 2 classes for the target variable we get 2 sub-tables. If the target variable had 3 classes we would get 3 sub-tables, one for each of the classes.

The following 2 tables show the dataset for target class/label ‘no’ and ‘yes’ respectively:

The class conditional probability can be computed using the Table-4 and Table-5

Predicting posterior probability:

Suppose we are now given a new feature vector:

Weather condition: rain
Road condition: good
Traffic condition: normal
Engine problem: no

The task is to predict if an accident will happen?

The posterior probability for each of the target classes is computed using the equation for posterior probability

Since $P(Accident_{no}|A_{new}) > 𝑃( Accident_{no}|A_{new})$ the prediction is Accident=‘no’.
The probabilities can be obtained by normalizing the posterior probabilities:

Advantages of Naïve Bayes Classifier:

Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
It can be used for Binary as well as Multi-class Classifications.
It performs well in Multi-class predictions as compared to the other Algorithms.
It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:

Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship between features.
Applications of Naïve Bayes Classifier:
It is used for Credit Scoring.
It is used in medical data classification.
It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
It is used in Text classification such as Spam filtering and Sentiment analysis.

Applications of Naive Bayes Algorithms

Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for making predictions in real time.

Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here we can predict the probability of multiple classes of target variable.

Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text classification (due to better result in multi class problems and independence rule) have higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments)

Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not

Types of Naïve Bayes Model:

There are three types of Naive Bayes Model, which are given below:

Gaussian: The Gaussian model assumes that features follow a normal distribution. This means if predictors take continuous values instead of discrete, then the model assumes that these values are sampled from the Gaussian distribution.

Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial distributed. It is primarily used for document classification problems, it means a particular document belongs to which category such as Sports, Politics, education, etc.

The classifier uses the frequency of words for the predictors.

Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor variables are the independent Booleans variables. Such as if a particular word is present or not in a document. This model is also famous for document classification tasks.

Example with python code

download the titanic.csv file from kaggle before doing this program

#import the libraries and load the titanic.csv file

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report, confusion_matrix

df=pd.read_csv('titanic.csv')

# do some data pre processing drop unwanted columns and setup input and target fileds

df.drop(['PassengerId','Name','SibSp','Parch','Ticket','Cabin','Embarked'],axis=1,inplace=True)

target=df.Survived

inputs=df.drop(['Survived'],axis=1)

print(inputs.head(3))

print(target.head(3))

Pclass     Sex   Age     Fare
0       3    male  22.0   7.2500
1       1  female  38.0  71.2833
2       3  female  26.0   7.9250
0    0
1    1
2    1

#do one hot encoding on sex column and drop it

dummies=pd.get_dummies(inputs.Sex)

dummies.head(3)

female male

inputs=pd.concat([inputs,dummies],axis=1)

inputs.head(3)

inputs.drop(['Sex'],axis=1,inplace=True)

# fill all null column values in age column with median value

inputs.columns[inputs.isna().any()]

inputs.Age=inputs.Age.fillna(inputs.Age.mean())

#do a train test data split 80:20

X_train,X_test,y_train,y_test=train_test_split(inputs,target,test_size=0.2)

print(len(inputs))

print(len(X_train))

print(len(X_test))

#train the Gaussian Naive Base model and compute the score

from sklearn.naive_bayes import GaussianNB

model=GaussianNB()

model.fit(X_train,y_train)

model.score(X_test,y_test)

0.7374301675977654

#check the prediction with actual values consider 10 records

print(y_test[:10])

model.predict(X_test[:10])

array([0, 0, 1, 0, 0, 1, 1, 0, 1, 1], dtype=int64)

model.predict_proba(X_test[:10])

array([[9.99948794e-01, 5.12059773e-05],
       [9.99954949e-01, 4.50513155e-05],
       [1.54513682e-06, 9.99998455e-01],
       [5.52541073e-09, 9.99999994e-01],
       [9.99618589e-01, 3.81410872e-04],
       [1.65503493e-06, 9.99998345e-01],
       [6.05856963e-07, 9.99999394e-01],
       [6.82368619e-07, 9.99999318e-01],
       [1.95886497e-07, 9.99999804e-01],
       [1.64566569e-06, 9.99998354e-01]])

Refer and learn more....text book chapter included

Search This Blog

Concepts in Machine Learning-CST 383 KTU-Minor Notes-Dr Binu V P

Naive Bayes Classifier

Comments

Post a Comment

Popular posts from this blog

Concepts in Machine Learning- CST 383 KTU Minor Notes- Dr Binu V P

Overview of Machine Learning

Syllabus Concepts in Machine Learning- CST 383 KTU