Day 6: Mastering Support Vector Machines (SVM)

Data Science 30 Days Course easy to learn

Welcome to Day 6 of the 30 Days of Data Science Series! Today, we’re diving into Support Vector Machines (SVM), a powerful supervised learning algorithm used for classification and regression tasks. By the end of this lesson, you’ll understand the concept, implementation, and evaluation of SVM in Python.

1. What is Support Vector Machine (SVM)?

SVM is a supervised learning algorithm that finds the optimal hyperplane to separate data points of different classes in the feature space. The goal is to maximize the margin, which is the distance between the hyperplane and the nearest data points from each class (called support vectors).

Key Concepts:

Hyperplane: A decision boundary that separates the classes. For 2D data, it’s a line; for 3D, it’s a plane.
Margin: The distance between the hyperplane and the nearest data points (support vectors). SVM aims to maximize this margin.
Kernel Trick: For non-linear data, SVM uses kernels to transform the input features into a higher-dimensional space where a linear separation is possible. Common kernels include:
- Linear Kernel: $K (x, y) = x^{T} y$
- Polynomial Kernel: $K (x, y) = (x^{T} y + c)^{d}$
- Radial Basis Function (RBF) Kernel: $K (x, y) = exp (- γ ∥ x - y ∥^{2})$
- Sigmoid Kernel: $K (x, y) = tanh (α x^{T} y + c)$
Regularization Parameter (C): Controls the trade-off between maximizing the margin and minimizing classification errors. A smaller $C$ creates a wider margin but allows more misclassifications, while a larger $C$ reduces misclassifications but may overfit.

2. When to Use SVM?

High-dimensional datasets (e.g., text classification, image recognition).
When the number of features is greater than the number of samples.
For non-linear data using kernel functions.

3. Implementation in Python

Let’s implement SVM for a classification problem using the Iris dataset.

Step 1: Import Libraries

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris

Step 2: Load and Prepare the Data

We’ll use the Iris dataset, which contains features like petal length and petal width to classify iris flowers into three species.

# Load Iris dataset
iris = load_iris()
X = iris.data[:, 2:4]  # Using petal length and petal width as features
y = iris.target        # Target variable (species)

Step 3: Train-Test Split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Step 4: Train the SVM Model

We’ll use the RBF kernel for this example.

model = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=0)
model.fit(X_train, y_train)

Step 5: Make Predictions

y_pred = model.predict(X_test)

Step 6: Evaluate the Model

Accuracy

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Output:

Accuracy: 1.0

Confusion Matrix

conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:n", conf_matrix)

Output:

Confusion Matrix:
 [[11  0  0]
  [ 0 12  1]
  [ 0  0  6]]

Classification Report

class_report = classification_report(y_test, y_pred)
print("Classification Report:n", class_report)

Output:

Classification Report:
               precision    recall  f1-score   support
           0       1.00      1.00      1.00        11
           1       1.00      0.92      0.96        13
           2       0.86      1.00      0.92         6
    accuracy                           0.97        30
   macro avg       0.95      0.97      0.96        30
weighted avg       0.97      0.97      0.97        30

Step 7: Visualize the Decision Boundary

def plot_decision_boundary(X, y, model):
    h = .02  # Step size in the mesh
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, alpha=0.8)

    sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, palette='bright', edgecolor='k', s=50)
    plt.xlabel('Petal Length')
    plt.ylabel('Petal Width')
    plt.title('SVM Decision Boundary')
    plt.show()

plot_decision_boundary(X_test, y_test, model)

4. Key Evaluation Metrics

Accuracy: Percentage of correct predictions.
Confusion Matrix:
- True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN).
Classification Report:
- Precision: Ratio of correctly predicted positive observations to total predicted positives.
- Recall: Ratio of correctly predicted positive observations to all actual positives.
- F1-Score: Weighted average of precision and recall.
- Support: Number of actual occurrences of each class.

5. Key Takeaways

SVM finds the optimal hyperplane to separate classes and maximizes the margin.
It uses kernel functions to handle non-linear data.
It’s effective for high-dimensional datasets but requires careful tuning of hyperparameters like $C$ and kernel parameters.

6. Practice Exercise

Experiment with different kernels (linear, poly, rbf, sigmoid) and observe their impact on model performance.
Tune the regularization parameter $C$ and gamma to see how they affect the decision boundary.
Apply SVM to a real-world dataset (e.g., Breast Cancer dataset) and evaluate the results.

7. Additional Resources

That’s it for Day 6! Tomorrow, we’ll explore K-Nearest Neighbors (KNN), a simple yet powerful algorithm for classification and regression. Keep practicing, and feel free to ask questions in the comments! 🚀