Welcome to Day 7 of the 30 Days of Data Science Series! Today, we’re exploring K-Nearest Neighbors (KNN), a simple yet powerful algorithm used for both classification and regression tasks. By the end of this lesson, you’ll understand the concept, implementation, and evaluation of KNN in Python.
1. What is K-Nearest Neighbors (KNN)?
KNN is a non-parametric, instance-based learning algorithm that predicts the value or class of a new sample based on the k closest samples (neighbors) in the training dataset. It’s a lazy learner because it doesn’t learn a model during training; instead, it stores the entire dataset and makes predictions at runtime.
Key Concepts:
-
Distance Metric: Measures the similarity between data points. Common metrics include:
-
Euclidean Distance: ∑i=1n(xi−yi)2
-
Manhattan Distance: ∑i=1n∣xi−yi∣
-
Minkowski Distance: A generalization of Euclidean and Manhattan distances.
-
-
Choosing k: The number of neighbors (k) is a hyperparameter that significantly impacts the model’s performance:
-
Smaller k: Sensitive to noise and outliers.
-
Larger k: Smoother decision boundaries but may underfit.
-
-
Prediction:
-
For classification: The predicted class is the majority class among the k nearest neighbors.
-
For regression: The predicted value is the average (or weighted average) of the k nearest neighbors.
-
2. When to Use KNN?
-
Small to medium-sized datasets.
-
When interpretability is important (KNN is easy to understand).
-
For datasets with non-linear relationships.
3. Implementation in Python
Let’s implement KNN for a classification problem using the Iris dataset.
Step 1: Import Libraries
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score, confusion_matrix, classification_report import matplotlib.pyplot as plt import seaborn as sns from sklearn.datasets import load_iris
Step 2: Load and Prepare the Data
We’ll use the Iris dataset, which contains features like sepal length and sepal width to classify iris flowers into three species.
# Load Iris dataset iris = load_iris() X = iris.data[:, :2] # Using sepal length and sepal width as features y = iris.target # Target variable (species)
Step 3: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Step 4: Train the KNN Model
We’ll use k=5 for this example.
model = KNeighborsClassifier(n_neighbors=5) model.fit(X_train, y_train)
Step 5: Make Predictions
y_pred = model.predict(X_test)
Step 6: Evaluate the Model
Accuracy
accuracy = accuracy_score(y_test, y_pred) print("Accuracy:", accuracy)
Output:
Accuracy: 0.9
Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred) print("Confusion Matrix:n", conf_matrix)
Output:
Confusion Matrix: [[11 0 0] [ 0 12 1] [ 0 1 5]]
Classification Report
class_report = classification_report(y_test, y_pred) print("Classification Report:n", class_report)
Output:
Classification Report: precision recall f1-score support 0 1.00 1.00 1.00 11 1 0.92 0.92 0.92 13 2 0.83 0.83 0.83 6 accuracy 0.93 30 macro avg 0.92 0.92 0.92 30 weighted avg 0.93 0.93 0.93 30
Step 7: Visualize the Decision Boundary
def plot_decision_boundary(X, y, model): h = .02 # Step size in the mesh x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1 y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1 xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) Z = model.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) plt.contourf(xx, yy, Z, alpha=0.8) sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, palette='bright', edgecolor='k', s=50) plt.xlabel('Sepal Length') plt.ylabel('Sepal Width') plt.title('KNN Decision Boundary (k=5)') plt.show() plot_decision_boundary(X_test, y_test, model)
4. Key Evaluation Metrics
-
Accuracy: Percentage of correct predictions.
-
Confusion Matrix:
-
True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN).
-
-
Classification Report:
-
Precision: Ratio of correctly predicted positive observations to total predicted positives.
-
Recall: Ratio of correctly predicted positive observations to all actual positives.
-
F1-Score: Weighted average of precision and recall.
-
Support: Number of actual occurrences of each class.
-
5. Key Takeaways
-
KNN is simple, intuitive, and doesn’t require training.
-
It’s sensitive to the choice of k and the distance metric.
-
It’s computationally expensive for large datasets.
6. Practice Exercise
-
Experiment with different values of k (e.g., 3, 7, 10) and observe how it affects the model’s performance.
-
Try different distance metrics (e.g., Manhattan, Minkowski) and compare the results.
-
Apply KNN to a real-world dataset (e.g., Breast Cancer dataset) and evaluate the results.
7. Additional Resources
That’s it for Day 7! Tomorrow, we’ll explore Naive Bayes, a probabilistic algorithm for classification tasks. Keep practicing, and feel free to ask questions in the comments! 🚀