Course Content
Machine Learning in just 30 Days
0/39
Data Science 30 Days Course easy to learn

    Welcome to Day 6 of the 30 Days of Data Science Series! Today, we’re diving into Support Vector Machines (SVM), a powerful supervised learning algorithm used for classification and regression tasks. By the end of this lesson, you’ll understand the concept, implementation, and evaluation of SVM in Python.


    1. What is Support Vector Machine (SVM)?

    SVM is a supervised learning algorithm that finds the optimal hyperplane to separate data points of different classes in the feature space. The goal is to maximize the margin, which is the distance between the hyperplane and the nearest data points from each class (called support vectors).

    Key Concepts:

    1. Hyperplane: A decision boundary that separates the classes. For 2D data, it’s a line; for 3D, it’s a plane.

    2. Margin: The distance between the hyperplane and the nearest data points (support vectors). SVM aims to maximize this margin.

    3. Kernel Trick: For non-linear data, SVM uses kernels to transform the input features into a higher-dimensional space where a linear separation is possible. Common kernels include:

      • Linear KernelK(x,y)=xTy

      • Polynomial KernelK(x,y)=(xTy+c)d

      • Radial Basis Function (RBF) KernelK(x,y)=exp⁡(−γ∥x−y∥2)

      • Sigmoid KernelK(x,y)=tanh⁡(αxTy+c)

    4. Regularization Parameter (C): Controls the trade-off between maximizing the margin and minimizing classification errors. A smaller C creates a wider margin but allows more misclassifications, while a larger C reduces misclassifications but may overfit.


    2. When to Use SVM?

    • High-dimensional datasets (e.g., text classification, image recognition).

    • When the number of features is greater than the number of samples.

    • For non-linear data using kernel functions.


    3. Implementation in Python

    Let’s implement SVM for a classification problem using the Iris dataset.

    Step 1: Import Libraries

    python
    Copy
    import numpy as np
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.svm import SVC
    from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.datasets import load_iris

    Step 2: Load and Prepare the Data

    We’ll use the Iris dataset, which contains features like petal length and petal width to classify iris flowers into three species.

    python
    Copy
    # Load Iris dataset
    iris = load_iris()
    X = iris.data[:, 2:4]  # Using petal length and petal width as features
    y = iris.target        # Target variable (species)

    Step 3: Train-Test Split

    python
    Copy
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    Step 4: Train the SVM Model

    We’ll use the RBF kernel for this example.

    python
    Copy
    model = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=0)
    model.fit(X_train, y_train)

    Step 5: Make Predictions

    python
    Copy
    y_pred = model.predict(X_test)

    Step 6: Evaluate the Model

    Accuracy

    python
    Copy
    accuracy = accuracy_score(y_test, y_pred)
    print("Accuracy:", accuracy)

    Output:

     
    Copy
    Accuracy: 1.0

    Confusion Matrix

    python
    Copy
    conf_matrix = confusion_matrix(y_test, y_pred)
    print("Confusion Matrix:n", conf_matrix)

    Output:

     
    Copy
    Confusion Matrix:
     [[11  0  0]
      [ 0 12  1]
      [ 0  0  6]]

    Classification Report

    python
    Copy
    class_report = classification_report(y_test, y_pred)
    print("Classification Report:n", class_report)

    Output:

     
    Copy
    Classification Report:
                   precision    recall  f1-score   support
               0       1.00      1.00      1.00        11
               1       1.00      0.92      0.96        13
               2       0.86      1.00      0.92         6
        accuracy                           0.97        30
       macro avg       0.95      0.97      0.96        30
    weighted avg       0.97      0.97      0.97        30

    Step 7: Visualize the Decision Boundary

    python
    Copy
    def plot_decision_boundary(X, y, model):
        h = .02  # Step size in the mesh
        x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
        y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
        xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    
        Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
        Z = Z.reshape(xx.shape)
        plt.contourf(xx, yy, Z, alpha=0.8)
    
        sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, palette='bright', edgecolor='k', s=50)
        plt.xlabel('Petal Length')
        plt.ylabel('Petal Width')
        plt.title('SVM Decision Boundary')
        plt.show()
    
    plot_decision_boundary(X_test, y_test, model)

    4. Key Evaluation Metrics

    1. Accuracy: Percentage of correct predictions.

    2. Confusion Matrix:

      • True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN).

    3. Classification Report:

      • Precision: Ratio of correctly predicted positive observations to total predicted positives.

      • Recall: Ratio of correctly predicted positive observations to all actual positives.

      • F1-Score: Weighted average of precision and recall.

      • Support: Number of actual occurrences of each class.


    5. Key Takeaways

    • SVM finds the optimal hyperplane to separate classes and maximizes the margin.

    • It uses kernel functions to handle non-linear data.

    • It’s effective for high-dimensional datasets but requires careful tuning of hyperparameters like C and kernel parameters.


    6. Practice Exercise

    1. Experiment with different kernels (linearpolyrbfsigmoid) and observe their impact on model performance.

    2. Tune the regularization parameter C and gamma to see how they affect the decision boundary.

    3. Apply SVM to a real-world dataset (e.g., Breast Cancer dataset) and evaluate the results.


    7. Additional Resources


    That’s it for Day 6! Tomorrow, we’ll explore K-Nearest Neighbors (KNN), a simple yet powerful algorithm for classification and regression. Keep practicing, and feel free to ask questions in the comments! 🚀

    Scroll to Top
    Verified by MonsterInsights