0
0

More than 1 year has passed since last update.

Beginners guide to Naïve Bayes algorithm (Part 2)

Last updated at Posted at 2021-04-12

Continue from the last article...

What is Naïve Bayes?

It is named Naïve since it is Naïve. Because it pretends that the appearance of a particular feature in a class is
irrelevant to any other feature's appearance. Let us take an example. Fruit can judge as orange if the fruit colouration is orange, the shape is round, and the diameter is 2.5 inches. All those characteristics independently contribute to the probability, although those characteristics depend on each other.

Naïve Bayes is practical when there are lots of data points. Otherwise, the model can be biased. Even the model is simple; it can outclass some advanced models for the same exercise

bayes.jpg

Naïve Bayes algorithm

I think the most straightforward way to experience how the algorithm work is to use an example. Cricket is a popular game in some countries. I will use a data set that contains pitch condition and target variable batting first (Many believe if the pitch is grassy, it is good to ball first.). Now we need to classify whether the team chooses batting first based on pitch condition. (Team need to win the coin toss first)
Steps:

  1. Making of frequency table (of classes).
  2. Creation of likelihood table of classes falling in a class given a feature value.
  3. Calculation of posterior probability for each class using the Naïve Bayesian equation.

Data Table

Pitch Condition Batting First
Dead Yes
Grassy No
Dusty Yes
Dead Yes
Grassy No
Grassy Yes
Dead Yes
Dusty Yes
Grassy No
Dusty Yes
Dusty No
Dead Yes
Grassy No
Dead No
Dead No

Frequency Table

Pitch type No Yes
Dead 2 4
Grassy 4 1
Dusty 1 3
Total 7 8

Likelihood Table

Pitch type No Yes
Dead 2 4 = 6/15 = 0.4
Grassy 4 1 = 5/15 = 0.33
Dusty 1 3 = 4/15 = 0.266
Total 7 8
= 7/15 = 0.46 = 8/15 = 0.53

Now the Question;
A Team will pick batting first if the pitch is Dusty. Is this comment correct or not?
To solve, let us calculate the posterior probability.

P(Yes, Dusty) = P(Dusty | Yes) * P(Yes) / P(Dusty) = (3/8 * 8/15) / (4/15) = 0.75

So, there is 75% portability batting first if the pitch is Dusty.
Naive Bayes uses a similar process to predict the probability of various classes based on several properties. This algorithm is primarily used in text categorization and with problems having various classes.

Naïve Bayes in Python.

The most popular library that includes the Naive Bayes algorithm is SciKit Learn. There are 3 types of models in the sci-kit learn library.

  1. Gaussian
  2. Multinomial
  3. Bernoulli
    We have to decide the model based on the dataset. I am not going to explain one by one here.

Example Code:

nb.py
from sklearn.naive_bayes import GaussianNB
import numpy as np

# feature and target variables
X = np.array([[8,2],[3,6], [5,1], [2,0], [4,3], [-3,0], [-2,1], [3,1], [-2,4], [5,7], [-1,1]])
y = np.array([1, 2, 3, 3, 2, 1, 1, 2, 3, 3, 2])

model = GaussianNB()
model.fit(X, y)
pred= model.predict([[5,2],[1,4], [-1,3]])

print(pred)
#[2 2 3]

Things to consider:

  • Removing correlated features.
  • Careful when selecting features and concern about data preprocessing.
  • Applying smoothing.
  • Convert continuous features to normal distribution.

Let us look advantages and disadvantages of the Naïve Bayes algorithm.

Advantages:

  • Easy and fast.
  • Worked well in multiclass.
  • Good for categorical variable.

Disadvantages:

  • Work best if have many data points.
  • The training data set must contain all the categorical variables. Otherwise, we have to use smoothing methods.
  • Difficult to use in real life situation since algorithm assumes predictors are independent.

*本記事は @qualitia_cdevの中の一人、@nuwanさんが書いてくれました。
*This article is written by @nuwan a member of @qualitia_cdev.

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0