Welcome to Day 6 of the 30 Days of Data Science Series! Today, we’re diving into Support Vector Machines (SVM), a powerful supervised learning algorithm used for classification and regression tasks. By the end of this lesson, you’ll understand the concept, implementation, and evaluation of SVM in Python.
1. What is Support Vector Machine (SVM)?
SVM is a supervised learning algorithm that finds the optimal hyperplane to separate data points of different classes in the feature space. The goal is to maximize the margin, which is the distance between the hyperplane and the nearest data points from each class (called support vectors).
Key Concepts:
-
Hyperplane: A decision boundary that separates the classes. For 2D data, it’s a line; for 3D, it’s a plane.
-
Margin: The distance between the hyperplane and the nearest data points (support vectors). SVM aims to maximize this margin.
-
Kernel Trick: For non-linear data, SVM uses kernels to transform the input features into a higher-dimensional space where a linear separation is possible. Common kernels include:
-
Linear Kernel: K(x,y)=xTy
-
Polynomial Kernel: K(x,y)=(xTy+c)d
-
Radial Basis Function (RBF) Kernel: K(x,y)=exp(−γ∥x−y∥2)
-
Sigmoid Kernel: K(x,y)=tanh(αxTy+c)
-
-
Regularization Parameter (C): Controls the trade-off between maximizing the margin and minimizing classification errors. A smaller C creates a wider margin but allows more misclassifications, while a larger C reduces misclassifications but may overfit.
2. When to Use SVM?
-
High-dimensional datasets (e.g., text classification, image recognition).
-
When the number of features is greater than the number of samples.
-
For non-linear data using kernel functions.
3. Implementation in Python
Let’s implement SVM for a classification problem using the Iris dataset.
Step 1: Import Libraries
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.svm import SVC from sklearn.metrics import accuracy_score, confusion_matrix, classification_report import matplotlib.pyplot as plt import seaborn as sns from sklearn.datasets import load_iris
Step 2: Load and Prepare the Data
We’ll use the Iris dataset, which contains features like petal length and petal width to classify iris flowers into three species.
# Load Iris dataset iris = load_iris() X = iris.data[:, 2:4] # Using petal length and petal width as features y = iris.target # Target variable (species)
Step 3: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Step 4: Train the SVM Model
We’ll use the RBF kernel for this example.
model = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=0) model.fit(X_train, y_train)
Step 5: Make Predictions
y_pred = model.predict(X_test)
Step 6: Evaluate the Model
Accuracy
accuracy = accuracy_score(y_test, y_pred) print("Accuracy:", accuracy)
Output:
Accuracy: 1.0
Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred) print("Confusion Matrix:n", conf_matrix)
Output:
Confusion Matrix: [[11 0 0] [ 0 12 1] [ 0 0 6]]
Classification Report
class_report = classification_report(y_test, y_pred) print("Classification Report:n", class_report)
Output:
Classification Report: precision recall f1-score support 0 1.00 1.00 1.00 11 1 1.00 0.92 0.96 13 2 0.86 1.00 0.92 6 accuracy 0.97 30 macro avg 0.95 0.97 0.96 30 weighted avg 0.97 0.97 0.97 30
Step 7: Visualize the Decision Boundary
def plot_decision_boundary(X, y, model): h = .02 # Step size in the mesh x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1 y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1 xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) Z = model.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) plt.contourf(xx, yy, Z, alpha=0.8) sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, palette='bright', edgecolor='k', s=50) plt.xlabel('Petal Length') plt.ylabel('Petal Width') plt.title('SVM Decision Boundary') plt.show() plot_decision_boundary(X_test, y_test, model)
4. Key Evaluation Metrics
-
Accuracy: Percentage of correct predictions.
-
Confusion Matrix:
-
True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN).
-
-
Classification Report:
-
Precision: Ratio of correctly predicted positive observations to total predicted positives.
-
Recall: Ratio of correctly predicted positive observations to all actual positives.
-
F1-Score: Weighted average of precision and recall.
-
Support: Number of actual occurrences of each class.
-
5. Key Takeaways
-
SVM finds the optimal hyperplane to separate classes and maximizes the margin.
-
It uses kernel functions to handle non-linear data.
-
It’s effective for high-dimensional datasets but requires careful tuning of hyperparameters like C and kernel parameters.
6. Practice Exercise
-
Experiment with different kernels (
linear
,poly
,rbf
,sigmoid
) and observe their impact on model performance. -
Tune the regularization parameter C and gamma to see how they affect the decision boundary.
-
Apply SVM to a real-world dataset (e.g., Breast Cancer dataset) and evaluate the results.
7. Additional Resources
That’s it for Day 6! Tomorrow, we’ll explore K-Nearest Neighbors (KNN), a simple yet powerful algorithm for classification and regression. Keep practicing, and feel free to ask questions in the comments! 🚀