Course Content
Machine Learning in just 30 Days
0/39
Data Science 30 Days Course easy to learn

    Welcome to Day 8 of the 30 Days of Data Science Series! Today, we’re diving into the Naive Bayes Algorithm, a probabilistic machine learning model widely used for classification tasks, especially in text classification. By the end of this lesson, you’ll understand the concept, implementation, and evaluation of Naive Bayes in Python.


    1. What is Naive Bayes?

    Naive Bayes is a family of probabilistic algorithms based on Bayes’ Theorem with the “naive” assumption that all features are independent of each other. Despite this simplifying assumption, Naive Bayes performs remarkably well in many real-world applications, particularly in text classification (e.g., spam detection, sentiment analysis).

    Key Concepts:

    1. Bayes’ Theorem:

      P(y∣X)=P(X∣y)⋅P(y)P(X)

      • P(y∣X): Posterior probability of class y given features X.

      • P(X∣y): Likelihood of features X given class y.

      • P(y): Prior probability of class y.

      • P(X): Probability of features X.

    2. Naive Assumption: Features are conditionally independent given the class label:

      P(X∣y)=P(x1∣y)⋅P(x2∣y)⋅⋯⋅P(xn∣y)

    3. Types of Naive Bayes:

      • Gaussian Naive Bayes: Assumes features follow a normal distribution.

      • Multinomial Naive Bayes: Used for discrete data (e.g., word counts in text).

      • Bernoulli Naive Bayes: Used for binary/boolean features.


    2. When to Use Naive Bayes?

    • Text classification tasks (e.g., spam detection, sentiment analysis).

    • High-dimensional datasets (e.g., text data with many words).

    • When interpretability and simplicity are important.


    3. Implementation in Python

    Let’s implement Naive Bayes for a spam classification problem using Python.

    Step 1: Import Libraries

    python
    Copy
    import numpy as np
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

    Step 2: Prepare the Data

    We’ll use a simple dataset with features like word frequencies to classify emails as spam or not spam.

    python
    Copy
    # Example data
    data = {
        'Feature1': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5],  # Word frequency for word 1
        'Feature2': [5, 4, 3, 2, 1, 5, 4, 3, 2, 1],  # Word frequency for word 2
        'Feature3': [1, 1, 1, 1, 1, 0, 0, 0, 0, 0],  # Word frequency for word 3
        'Spam': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]      # Target variable (0 = Not Spam, 1 = Spam)
    }
    df = pd.DataFrame(data)

    Step 3: Split Data into Features and Target

    python
    Copy
    X = df[['Feature1', 'Feature2', 'Feature3']]  # Features
    y = df['Spam']                                # Target

    Step 4: Train-Test Split

    python
    Copy
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    Step 5: Train the Naive Bayes Model

    We’ll use Multinomial Naive Bayes for this example.

    python
    Copy
    model = MultinomialNB()
    model.fit(X_train, y_train)

    Step 6: Make Predictions

    python
    Copy
    y_pred = model.predict(X_test)

    Step 7: Evaluate the Model

    Accuracy

    python
    Copy
    accuracy = accuracy_score(y_test, y_pred)
    print("Accuracy:", accuracy)

    Output:

     
    Copy
    Accuracy: 1.0

    Confusion Matrix

    python
    Copy
    conf_matrix = confusion_matrix(y_test, y_pred)
    print("Confusion Matrix:n", conf_matrix)

    Output:

     
    Copy
    Confusion Matrix:
     [[1 0]
      [0 1]]

    Classification Report

    python
    Copy
    class_report = classification_report(y_test, y_pred)
    print("Classification Report:n", class_report)

    Output:

     
    Copy
    Classification Report:
                   precision    recall  f1-score   support
               0       1.00      1.00      1.00         1
               1       1.00      1.00      1.00         1
        accuracy                           1.00         2
       macro avg       1.00      1.00      1.00         2
    weighted avg       1.00      1.00      1.00         2

    4. Key Evaluation Metrics

    1. Accuracy: Percentage of correct predictions.

    2. Confusion Matrix:

      • True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN).

    3. Classification Report:

      • Precision: Ratio of correctly predicted positive observations to total predicted positives.

      • Recall: Ratio of correctly predicted positive observations to all actual positives.

      • F1-Score: Weighted average of precision and recall.

      • Support: Number of actual occurrences of each class.


    5. Key Takeaways

    • Naive Bayes is simple, fast, and effective for text classification.

    • It assumes feature independence, which may not hold true in all cases.

    • It’s widely used in spam detection, sentiment analysis, and recommendation systems.


    6. Practice Exercise

    1. Experiment with different types of Naive Bayes (Gaussian, Multinomial, Bernoulli) and compare their performance.

    2. Apply Naive Bayes to a real-world dataset (e.g., SMS Spam Collection dataset) and evaluate the results.

    3. Explore how feature scaling affects Gaussian Naive Bayes.


    7. Additional Resources


    That’s it for Day 8! Tomorrow, we’ll explore Principal Component Analysis (PCA), a dimensionality reduction technique. Keep practicing, and feel free to ask questions in the comments! 🚀

    Scroll to Top
    Verified by MonsterInsights