Day 7: Mastering K-Nearest Neighbors (KNN)

Data Science 30 Days Course easy to learn

Welcome to Day 7 of the 30 Days of Data Science Series! Today, we’re exploring K-Nearest Neighbors (KNN), a simple yet powerful algorithm used for both classification and regression tasks. By the end of this lesson, you’ll understand the concept, implementation, and evaluation of KNN in Python.

1. What is K-Nearest Neighbors (KNN)?

KNN is a non-parametric, instance-based learning algorithm that predicts the value or class of a new sample based on the $k$ closest samples (neighbors) in the training dataset. It’s a lazy learner because it doesn’t learn a model during training; instead, it stores the entire dataset and makes predictions at runtime.

Key Concepts:

Distance Metric: Measures the similarity between data points. Common metrics include:
- Euclidean Distance: $\sum^{i = 1 n} (x^{i} - y^{i})^{2}$
- Manhattan Distance: $\sum_{i = 1 n} ∣ x_{i} - y_{i} ∣$
- Minkowski Distance: A generalization of Euclidean and Manhattan distances.
Choosing $k$ : The number of neighbors ( $k$ ) is a hyperparameter that significantly impacts the model’s performance:
- Smaller $k$ : Sensitive to noise and outliers.
- Larger $k$ : Smoother decision boundaries but may underfit.
Prediction:
- For classification: The predicted class is the majority class among the $k$ nearest neighbors.
- For regression: The predicted value is the average (or weighted average) of the $k$ nearest neighbors.

2. When to Use KNN?

Small to medium-sized datasets.
When interpretability is important (KNN is easy to understand).
For datasets with non-linear relationships.

3. Implementation in Python

Let’s implement KNN for a classification problem using the Iris dataset.

Step 1: Import Libraries

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris

Step 2: Load and Prepare the Data

We’ll use the Iris dataset, which contains features like sepal length and sepal width to classify iris flowers into three species.

# Load Iris dataset
iris = load_iris()
X = iris.data[:, :2]  # Using sepal length and sepal width as features
y = iris.target       # Target variable (species)

Step 3: Train-Test Split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Step 4: Train the KNN Model

We’ll use $k = 5$ for this example.

model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)

Step 5: Make Predictions

y_pred = model.predict(X_test)

Step 6: Evaluate the Model

Accuracy

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Output:

Accuracy: 0.9

Confusion Matrix

conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:n", conf_matrix)

Output:

Confusion Matrix:
 [[11  0  0]
  [ 0 12  1]
  [ 0  1  5]]

Classification Report

class_report = classification_report(y_test, y_pred)
print("Classification Report:n", class_report)

Output:

Classification Report:
               precision    recall  f1-score   support
           0       1.00      1.00      1.00        11
           1       0.92      0.92      0.92        13
           2       0.83      0.83      0.83         6
    accuracy                           0.93        30
   macro avg       0.92      0.92      0.92        30
weighted avg       0.93      0.93      0.93        30

Step 7: Visualize the Decision Boundary

def plot_decision_boundary(X, y, model):
    h = .02  # Step size in the mesh
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, alpha=0.8)

    sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, palette='bright', edgecolor='k', s=50)
    plt.xlabel('Sepal Length')
    plt.ylabel('Sepal Width')
    plt.title('KNN Decision Boundary (k=5)')
    plt.show()

plot_decision_boundary(X_test, y_test, model)

4. Key Evaluation Metrics

Accuracy: Percentage of correct predictions.
Confusion Matrix:
- True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN).
Classification Report:
- Precision: Ratio of correctly predicted positive observations to total predicted positives.
- Recall: Ratio of correctly predicted positive observations to all actual positives.
- F1-Score: Weighted average of precision and recall.
- Support: Number of actual occurrences of each class.

5. Key Takeaways

KNN is simple, intuitive, and doesn’t require training.
It’s sensitive to the choice of $k$ and the distance metric.
It’s computationally expensive for large datasets.

6. Practice Exercise

Experiment with different values of $k$ (e.g., 3, 7, 10) and observe how it affects the model’s performance.
Try different distance metrics (e.g., Manhattan, Minkowski) and compare the results.
Apply KNN to a real-world dataset (e.g., Breast Cancer dataset) and evaluate the results.

7. Additional Resources

That’s it for Day 7! Tomorrow, we’ll explore Naive Bayes, a probabilistic algorithm for classification tasks. Keep practicing, and feel free to ask questions in the comments! 🚀