Ashish is a techology consultant with 13+ years of experience and specializes in Data Science, the Python ecosystem and Django, DevOps and automation. He specializes in the design and delivery of key, impactful programs.
For enquiries call:
+1-469-442-0620
HomeBlogData ScienceNaive Bayes in Machine Learning [Examples, Models, Types]
In this article, I'll walk you through the fundamentals of Naive Bayes, a robust machine learning algorithm. Known for its simplicity, speed, and effectiveness, especially in real-time scenarios, Naive Bayes leverages Bayes' theorem and assumes feature independence for swift predictions. We'll explore its applications, including spam filtering and sentiment analysis, highlighting its strengths. However, it's essential to acknowledge limitations tied to the assumption of feature independence. Nonetheless, Naive Bayes remains a valuable tool, offering accurate outcomes with minimal training data. Join me as we navigate the key aspects of Naive Bayes in the professional field of machine learning. But before we delve into the concepts of Naive Bayes, here’s a term I believe you should know – Conditional Probability.
Conditional probability, a core concept in probability theory, gauges the likelihood of an event A occurring given that another event B has already transpired. It is illustrated through examples like drawing cards from a deck or predicting students' genders in a school, it delves into the probability of A given the occurrence of B. Mathematically expressed as P(A|B) = P(A AND B) / P(B), it quantifies the probability adjustment based on existing information. Understanding conditional probability is crucial in various fields, aiding in decision-making, statistical analysis, and machine learning, where it plays a pivotal role in algorithms like Naive Bayes.
Naive Bayes is a simple but surprisingly powerful probabilistic machine learning algorithm used for predictive modeling and classification tasks. Some typical applications of Naive Bayes are spam filtering, sentiment prediction, classification of documents, etc. It is a popular algorithm mainly because it can be easily written in code and predictions can be made real quick which in turn increases the scalability of the solution. The Naive Bayes algorithm is traditionally considered the algorithm of choice for practical-based applications mostly in cases where instantaneous responses are required for user requests.
It is based on the works of the Rev. Thomas Bayes and hence the name. Before starting off with Naive Bayes, it is important to learn about Bayesian learning, what is ‘Conditional Probability' and ‘Bayes Rule’. Learners can enroll in Data Science courses in India and across the globe to learn more about the application of Bayes Theorem in ML projects.
Bayesian learning is a supervised learning technique where the goal is to build a model of the distribution of class labels that have a concrete definition of the target attribute. Naïve Bayes is based on applying Bayes' theorem with the naïve assumption of independence between each and every pair of features.
Let us start with the primitives by understanding Conditional Probability with some examples.
Example 1
Consider you have a coin and fair dice. When you flip a coin, there is an equal chance of getting either a head or a tail. So you can say that the probability of getting heads or the probability of getting tails is 50%.
Now if you roll the fair dice, the probability of getting 1 out of the 6 numbers would be 1/6 = 0.166. The probability will also be the same for other numbers on the dice.
Example 2
Consider another example of playing cards. You are asked to pick a card from the deck. Can you guess the probability of getting a king given the card is a heart?
The given condition here is that the card is a heart, so the denominator has to be 13 (there are 13 hearts in a deck of cards) and not 52. Since there is only one king in hearts, so the probability that the card is a king given it is a heart is 1/13 = 0.077.
So when you say the conditional probability of A given B, it refers to the probability of the occurrence of A given that B has already occurred. This is a typical example of conditional probability.
Mathematically, the conditional probability of A given B can be defined as P(A AND B) / P(B).
Example 3
Let us see another slightly complicated example to understand conditional probability better.
Consider a school with a total population of 100 people. These 100 people can be classified as either ‘Students’ and ‘Teachers’ or as a population of ‘Males’ and ‘Females’.
With the table below of the 100 people tabulated in some form, what will be the conditional probability that a certain person of the school is a ‘Student’ given that she is a ‘Female’?
Female | Male | Total | |
---|---|---|---|
Teacher | 10 | 10 | 20 |
Student | 30 | 50 | 80 |
Total | 40 | 60 | 100 |
To compute this, you can filter the sub-population of 40 females and focus only on the 30 female students. So the required probability stands as P(Student | Female) = 30/40 = 0.75 .
P(Student | Female) = [P(Student ∩ Female)] / [P(Female)] = 30/40 = 0.75
This is defined as the intersection(∩) of Student(A) and Female(B) divided by Female(B). Similarly, the conditional probability of B given A can also be calculated using the same mathematical expression.
Bayes' Theorem helps you examine the probability of an event based on the prior knowledge of any event that has correspondence to the former event. Its uses are mainly found in probability theory and statistics. The term naive is used in the sense that the features given to the model are not dependent on each other. In simple terms, if you change the value of one feature in the algorithm, it will not directly influence or change the value of the other features.
Consider, for example the probability that the price of a house is high can be calculated better if we have some prior information, like the facilities around it, compared to another assessment made without the knowledge of the location of the house.
P(A|B) = [P(B|A)P(A)]/[P(B)]
The equation above shows the basic representation of Bayes' theorem where A and B are two events and:
P(A|B): The conditional probability that event A occurs, given that B has occurred. This is termed as the posterior probability.
P(A) and P(B): The probability of A and B without any correspondence with each other.
P(B|A): The conditional probability of the occurrence of event B, given that A has occurred.
Now the question is how you can use naive Bayes in machine learning. To understand it clearly, let us take an example.
Consider a simple problem where you need to learn a machine learning model from a given set of attributes. Then you will have to describe a hypothesis or a relation to a response variable and then using this relation, you will have to predict a response, given the set of attributes you have.
You can create a learner using Bayes' Theorem that can predict the probability of the response variable that will belong to the same class, given a new set of attributes.
Consider the previous question again and then assume that A is the response variable and B is the given attribute. So according to the equation of Bayes' Theorem, we have:
P(A|B): The conditional probability of the response variable that belongs to a particular value, given the input attributes, also known as the posterior probability.
P(A): The prior probability of the response variable.
P(B): The probability of training data(input attributes) or the evidence.
P(B|A): This is termed as the likelihood of the training data.
The Bayes' Theorem can be reformulated in correspondence with the machine learning algorithm as:
posterior = (prior x likelihood) / (evidence)
Let’s look into another problem. Consider a situation where the number of attributes is n, and the response is a Boolean value. i.e. Either True or False. The attributes are categorical (2 categories in this case). You need to train the classifier for all the values in the instance and the response space.
This example is practically not possible in most machine learning algorithms since you need to compute 2∗(2^n-1) parameters for learning this model. This means for 30 boolean attributes; you will need to learn more than 3 billion parameters which is unrealistic.
A classifier is a machine learning model which is used to classify different objects based on certain behavior. Naive Bayes classifiers in machine learning are a family of simple probabilistic machine learning models that are based on Bayes' Theorem. In simple words, it is a classification technique with an assumption of independence among predictors.
The Naive Bayes classifier reduces the complexity of the Bayesian classifier by making an assumption of conditional dependence over the training dataset.
Consider you are given variables X, Y, and Z. X will be conditionally independent of Y given Z if and only if the probability distribution of X is independent of the value of Y given Z. This is the assumption of conditional dependence.
In other words, you can also say that X and Y are conditionally independent given Z if and only if, the knowledge of the occurrence of X provides no information on the likelihood of the occurrence of Y and vice versa, given that Z occurs. This assumption is the reason behind the term naive in Naive Bayes.
The likelihood can be written considering n different attributes as:
n P(X₁...Xₙ|Y) = π P(Xᵢ|Y) i=1
In the mathematical expression, X represents the attributes, Y represents the response variable. So, P(X|Y) becomes equal to the product of the probability distribution of each attribute given Y.
Maximizing a Posteriori
If you want to find the posterior probability of P(Y|X) for multiple values of Y, you need to calculate the expression for all the different values of Y.
Let us assume a new instance variable X_NEW. You need to calculate the probability that Y will take any value given the observed attributes of X_NEW and given the distributions P(Y) and P(X|Y) which are estimated from the training dataset.
In order to predict the response variable depending on the different values obtained for P(Y|X), you need to consider a probable value or the maximum of the values. Hence, this method is known as maximizing a posteriori.
Maximizing Likelihood
You can simplify the Naive Bayes algorithm if you assume that the response variable is uniformly distributed which means that it is equally likely to get any response. The advantage of this assumption is that the a priori or the P(Y) becomes a constant value.
Since the a priori and the evidence become independent from the response variable, they can be removed from the equation. So, maximizing the posteriori becomes maximizing the likelihood problem. You can solve similar machine learning problems and apply Bayes theorem in data science with python.
Consider a situation where you have 1000 fruits which are either ‘banana’ or ‘apple’ or ‘other’. These will be the possible classes of the variable Y.
The data for the following X variables all of which are in binary (0 and 1):
The training dataset will look like this:
Fruit | Long (x1) | Sweet (x2) | Yellow (x3) |
---|---|---|---|
Apple | 0 | 0 | 1 |
Banana | 1 | 0 | 1 |
Apple | 0 | 1 | 0 |
Other | 1 | 1 | 1 |
.. | .. | .. | .. |
Now let us sum up the training dataset to form a count table as below:
Type | Long | Not Long | Sweet | Not sweet | Yellow | Not Yellow | Total |
---|---|---|---|---|---|---|---|
Banana | 400 | 100 | 350 | 150 | 450 | 50 | 500 |
Apple | 0 | 300 | 150 | 150 | 300 | 0 | 300 |
Other | 100 | 100 | 150 | 50 | 50 | 150 | 200 |
Total | 500 | 500 | 650 | 350 | 800 | 200 | 1000 |
The main agenda of the classifier is to predict if a given fruit is a ‘Banana’ or an ‘Apple’ or ‘Other’ when the three attributes(long, sweet and yellow) are known.
Consider a case where you’re given that a fruit is long, sweet and yellow and you need to predict what type of fruit it is. This case is similar to the case where you need to predict Y only when the X attributes in the training dataset are known. You can easily solve this problem by using Naive Bayes.
The thing you need to do is to compute the 3 probabilities,i.e. the probability of being a banana or an apple or other. The one with the highest probability will be your answer.
Step 1:
First of all, you need to compute the proportion of each fruit class out of all the fruits from the population which is the prior probability of each fruit class.
The Prior probability can be calculated from the training dataset:
P(Y=Banana) = 500 / 1000 = 0.50
P(Y=Apple) = 300 / 1000 = 0.30
P(Y=Other) = 200 / 1000 = 0.20
The training dataset contains 1000 records. Out of which, you have 500 bananas, 300 apples and 200 others. So the priors are 0.5, 0.3 and 0.2 respectively.
Step 2:
Secondly, you need to calculate the probability of evidence that goes into the denominator. It is simply the product of P of X’s for all X:
P(x1=Long) = 500 / 1000 = 0.50
P(x2=Sweet) = 650 / 1000 = 0.65
P(x3=Yellow) = 800 / 1000 = 0.80
Step 3:
The third step is to compute the probability of likelihood of evidence which is nothing but the product of conditional probabilities of the 3 attributes.
The Probability of Likelihood for Banana:
P(x1=Long | Y=Banana) = 400 / 500 = 0.80
P(x2=Sweet | Y=Banana) = 350 / 500 = 0.70
P(x3=Yellow | Y=Banana) = 450 / 500 = 0.90
Therefore, the overall probability of likelihood for banana will be the product of the above three,i.e. 0.8 * 0.7 * 0.9 = 0.504.
Step 4:
The last step is to substitute all the 3 equations into the mathematical expression of Naive Bayes to get the probability.
P(Banana|Long,Sweet and Yellow) = [P(Long|Banana)∗P(Sweet|Banana)∗P(Yellow|Banana) x P(Banana)] / [P(Long)∗P(Sweet)∗P(Yellow)]
= 0.8∗0.7∗0.9∗0.5/[P(Evidence)] = 0.252/[P(Evidence)]
P(Apple|Long,Sweet and Yellow) = 0, because P(Long|Apple) = 0
P(Other|Long,Sweet and Yellow) = 0.01875/P(Evidence)
In a similar way, you can also compute the probabilities for ‘Apple’ and ‘Other’. The denominator is the same for all cases.
Banana gets the highest probability, so that will be considered as the predicted class. Continue reading the blog to understand more about the naive Bayes algorithm in machine learning.
The main types of Naive Bayes classifier are mentioned below:
Since the likelihood of the features is assumed to be Gaussian, the conditional probability will change in the following manner:
P(xᵢ|y) = 1/(√2пσ²ᵧ) exp[ –(xᵢ - μᵧ)²/2σ²ᵧ]
The naive Bayes algorithm has both its pros and its cons.
Pros of Naive Bayes —
Cons of Naive Bayes —
When you have a model with a lot of attributes, it is possible that the entire probability might become zero because one of the feature’s values is zero. To overcome this situation, you can increase the count of the variable with zero to a small value like in the numerator so that the overall probability doesn’t come as zero.
This type of correction is called the Laplace Correction. Usually, all naive Bayes models use this implementation as a parameter.
There are a lot of real-life applications of the Naive Bayes classifier, some of which are mentioned below:
In Python, the Naive Bayes classifier is implemented in the scikit-learn library. Let us look into an example by importing the standard iris dataset to predict the Species of flowers:
# Import packages from sklearn.naive_bayes import GaussianNB from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns; sns.set() # Import data training = pd.read_csv('/content/iris_training.csv') test = pd.read_csv('/content/iris_test.csv') # Create the X, Y, Training and Test X_Train = training.drop('Species', axis=1) Y_Train = training.loc[:, 'Species'] X_Test = test.drop('Species', axis=1) Y_Test = test.loc[:, 'Species'] # Init the Gaussian Classifier model = GaussianNB() # Train the model model.fit(X_Train, Y_Train) # Predict Output pred = model.predict(X_Test) # Plot Confusion Matrix mat = confusion_matrix(pred, Y_Test) names = np.unique(pred) sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False, xticklabels=names, yticklabels=names) plt.xlabel('Truth') plt.ylabel('Predicted')
The output will be as follows:
Text(89.18, 0.5, 'Predicted')
You can improve the power of a Naive Bayes model by following these tips:
Unlock your business potential with ccba classes. Gain the skills and knowledge to excel in the ever-evolving business world. Enroll today!
Let's review what we've covered:
I've talked about Naive Bayes and its types, discussed the pros and cons, and explored where it's applied, like sentiment analysis and spam filtering. I explained how a Naive Bayes model predicts outcomes and walked through creating and improving one. Naive Bayes is handy in real-world tasks due to its speed and simplicity, but there's a catch – it works best when predictors are independent. In real-life situations, though, predictors often depend on each other, which can impact the classifier's performance. Nonetheless, Naive Bayes remains a popular choice for quick and straightforward solutions in machine learning..
We have covered most of the topics related to algorithms in our series of machine learning blogs. If you are inspired by the opportunities provided by machine learning, Enroll in KnowledgeHut Data Science with Python for more lucrative career options in this landscape.
High scalability
Gives more accuracy in less amount of data
Less training time.
Provides partial_fit mechanism while training the model with a large amount of data.
Considering each feature as an independent entity gives more accuracy.
It assumes that the value of a particular feature is independent of the value of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter.
A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features. It uses the method of maximum likelihood; Despite its design and apparently oversimplified assumptions.
Naive Bayes classifiers have worked quite well in many complex real-world situations. An advantage of naive Bayes is that it only requires a small number of training data to estimate the parameters necessary for classification.
It is very fast and scalable. So many real-time prediction applications use it.
Naive Bayes uses the Bayes theorem as the base of the algorithm. It is an inspiration for it. Bayes's theorem is not made for classification or telling about the classification. Naive is a collection of elements in algorithms used for classification. Bayes's theorem is one of them.
By fighting the dataset into the Bayes theorem, the Naive Bayes algorithm is prepared with the highest probability that why we used argmax. And following the equation, we will get
y= argmaxy P(Y)
∏ni=1∏i=1n
P( Xi / y)
As we know, the Naive Bayes algorithm has different types based on assumptions of the distribution of P( Xi / y). For more details, check out What is a Naive Bayes Classifier?
It is basically a classification algorithm. It trained to identify categories and predict in which category they fall for new values. But it strongly depended on assumptions by doing some changes in that you can use it as regression. for more details please refer to the (PDF) Naive Bayes for Regression (researchgate.net).
Name | Date | Fee | Know more |
---|