Welcome to Day 8 of the 30 Days of Data Science Series! Today, we’re diving into the Naive Bayes Algorithm, a probabilistic machine learning model widely used for classification tasks, especially in text classification. By the end of this lesson, you’ll understand the concept, implementation, and evaluation of Naive Bayes in Python.
1. What is Naive Bayes?
Naive Bayes is a family of probabilistic algorithms based on Bayes’ Theorem with the “naive” assumption that all features are independent of each other. Despite this simplifying assumption, Naive Bayes performs remarkably well in many real-world applications, particularly in text classification (e.g., spam detection, sentiment analysis).
Key Concepts:
Bayes’ Theorem:
P(y∣X)=P(X∣y)⋅P(y)P(X)
P(y∣X): Posterior probability of class y given features X.
P(X∣y): Likelihood of features X given class y.
P(y): Prior probability of class y.
P(X): Probability of features X.
Naive Assumption: Features are conditionally independent given the class label:
P(X∣y)=P(x1∣y)⋅P(x2∣y)⋅⋯⋅P(xn∣y)
Types of Naive Bayes:
Gaussian Naive Bayes: Assumes features follow a normal distribution.
Multinomial Naive Bayes: Used for discrete data (e.g., word counts in text).
Bernoulli Naive Bayes: Used for binary/boolean features.
2. When to Use Naive Bayes?
Text classification tasks (e.g., spam detection, sentiment analysis).
High-dimensional datasets (e.g., text data with many words).
When interpretability and simplicity are important.
3. Implementation in Python
Let’s implement Naive Bayes for a spam classification problem using Python.
Step 1: Import Libraries
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
Step 2: Prepare the Data
We’ll use a simple dataset with features like word frequencies to classify emails as spam or not spam.
# Example data data = { 'Feature1': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5], # Word frequency for word 1 'Feature2': [5, 4, 3, 2, 1, 5, 4, 3, 2, 1], # Word frequency for word 2 'Feature3': [1, 1, 1, 1, 1, 0, 0, 0, 0, 0], # Word frequency for word 3 'Spam': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1] # Target variable (0 = Not Spam, 1 = Spam) } df = pd.DataFrame(data)
Step 3: Split Data into Features and Target
X = df[['Feature1', 'Feature2', 'Feature3']] # Features y = df['Spam'] # Target
Step 4: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Step 5: Train the Naive Bayes Model
We’ll use Multinomial Naive Bayes for this example.
model = MultinomialNB() model.fit(X_train, y_train)
Step 6: Make Predictions
y_pred = model.predict(X_test)
Step 7: Evaluate the Model
Accuracy
accuracy = accuracy_score(y_test, y_pred) print("Accuracy:", accuracy)
Output:
Accuracy: 1.0
Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred) print("Confusion Matrix:n", conf_matrix)
Output:
Confusion Matrix: [[1 0] [0 1]]
Classification Report
class_report = classification_report(y_test, y_pred) print("Classification Report:n", class_report)
Output:
Classification Report: precision recall f1-score support 0 1.00 1.00 1.00 1 1 1.00 1.00 1.00 1 accuracy 1.00 2 macro avg 1.00 1.00 1.00 2 weighted avg 1.00 1.00 1.00 2
4. Key Evaluation Metrics
Accuracy: Percentage of correct predictions.
Confusion Matrix:
True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN).
Classification Report:
Precision: Ratio of correctly predicted positive observations to total predicted positives.
Recall: Ratio of correctly predicted positive observations to all actual positives.
F1-Score: Weighted average of precision and recall.
Support: Number of actual occurrences of each class.
5. Key Takeaways
Naive Bayes is simple, fast, and effective for text classification.
It assumes feature independence, which may not hold true in all cases.
It’s widely used in spam detection, sentiment analysis, and recommendation systems.
6. Practice Exercise
Experiment with different types of Naive Bayes (Gaussian, Multinomial, Bernoulli) and compare their performance.
Apply Naive Bayes to a real-world dataset (e.g., SMS Spam Collection dataset) and evaluate the results.
Explore how feature scaling affects Gaussian Naive Bayes.
7. Additional Resources
That’s it for Day 8! Tomorrow, we’ll explore Principal Component Analysis (PCA), a dimensionality reduction technique. Keep practicing, and feel free to ask questions in the comments! 🚀