Day 8: Mastering Naive Bayes Algorithm

Data Science 30 Days Course easy to learn

Welcome to Day 8 of the 30 Days of Data Science Series! Today, we’re diving into the Naive Bayes Algorithm, a probabilistic machine learning model widely used for classification tasks, especially in text classification. By the end of this lesson, you’ll understand the concept, implementation, and evaluation of Naive Bayes in Python.

1. What is Naive Bayes?

Naive Bayes is a family of probabilistic algorithms based on Bayes’ Theorem with the “naive” assumption that all features are independent of each other. Despite this simplifying assumption, Naive Bayes performs remarkably well in many real-world applications, particularly in text classification (e.g., spam detection, sentiment analysis).

Key Concepts:

Bayes’ Theorem:
$P (y ∣ X) = P ( X ) P ( X ∣ y ) \cdot P ( y )$
- $P (y ∣ X)$ : Posterior probability of class $y$ given features $X$ .
- $P (X ∣ y)$ : Likelihood of features $X$ given class $y$ .
- $P (y)$ : Prior probability of class $y$ .
- $P (X)$ : Probability of features $X$ .
Naive Assumption: Features are conditionally independent given the class label:
$P (X ∣ y) = P (x_{1} ∣ y) \cdot P (x_{2} ∣ y) \cdot \dots \cdot P (x_{n} ∣ y)$
Types of Naive Bayes:
- Gaussian Naive Bayes: Assumes features follow a normal distribution.
- Multinomial Naive Bayes: Used for discrete data (e.g., word counts in text).
- Bernoulli Naive Bayes: Used for binary/boolean features.

2. When to Use Naive Bayes?

Text classification tasks (e.g., spam detection, sentiment analysis).
High-dimensional datasets (e.g., text data with many words).
When interpretability and simplicity are important.

3. Implementation in Python

Let’s implement Naive Bayes for a spam classification problem using Python.

Step 1: Import Libraries

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Step 2: Prepare the Data

We’ll use a simple dataset with features like word frequencies to classify emails as spam or not spam.

# Example data
data = {
    'Feature1': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5],  # Word frequency for word 1
    'Feature2': [5, 4, 3, 2, 1, 5, 4, 3, 2, 1],  # Word frequency for word 2
    'Feature3': [1, 1, 1, 1, 1, 0, 0, 0, 0, 0],  # Word frequency for word 3
    'Spam': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]      # Target variable (0 = Not Spam, 1 = Spam)
}
df = pd.DataFrame(data)

Step 3: Split Data into Features and Target

X = df[['Feature1', 'Feature2', 'Feature3']]  # Features
y = df['Spam']                                # Target

Step 4: Train-Test Split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Step 5: Train the Naive Bayes Model

We’ll use Multinomial Naive Bayes for this example.

model = MultinomialNB()
model.fit(X_train, y_train)

Step 6: Make Predictions

y_pred = model.predict(X_test)

Step 7: Evaluate the Model

Accuracy

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Output:

Accuracy: 1.0

Confusion Matrix

conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:n", conf_matrix)

Output:

Confusion Matrix:
 [[1 0]
  [0 1]]

Classification Report

class_report = classification_report(y_test, y_pred)
print("Classification Report:n", class_report)

Output:

Classification Report:
               precision    recall  f1-score   support
           0       1.00      1.00      1.00         1
           1       1.00      1.00      1.00         1
    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2

4. Key Evaluation Metrics

Accuracy: Percentage of correct predictions.
Confusion Matrix:
- True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN).
Classification Report:
- Precision: Ratio of correctly predicted positive observations to total predicted positives.
- Recall: Ratio of correctly predicted positive observations to all actual positives.
- F1-Score: Weighted average of precision and recall.
- Support: Number of actual occurrences of each class.

5. Key Takeaways

Naive Bayes is simple, fast, and effective for text classification.
It assumes feature independence, which may not hold true in all cases.
It’s widely used in spam detection, sentiment analysis, and recommendation systems.

6. Practice Exercise

Experiment with different types of Naive Bayes (Gaussian, Multinomial, Bernoulli) and compare their performance.
Apply Naive Bayes to a real-world dataset (e.g., SMS Spam Collection dataset) and evaluate the results.
Explore how feature scaling affects Gaussian Naive Bayes.

7. Additional Resources

That’s it for Day 8! Tomorrow, we’ll explore Principal Component Analysis (PCA), a dimensionality reduction technique. Keep practicing, and feel free to ask questions in the comments! 🚀