Day 10: Mastering k-Means Clustering

Data Science 30 Days Course easy to learn

Welcome to Day 10 of the 30 Days of Data Science Series! Today, we’re diving into k-Means Clustering, a popular unsupervised learning algorithm used for grouping data into clusters. By the end of this lesson, you’ll understand the concept, implementation, and evaluation of k-Means in Python.

1. What is k-Means Clustering?

k-Means is an unsupervised learning algorithm used to partition a dataset into $k$ clusters, where each data point belongs to the cluster with the nearest mean (centroid). The algorithm aims to minimize the within-cluster variance, making the clusters as compact as possible.

Key Concepts:

Centroids: The center of each cluster, calculated as the mean of all points in the cluster.
Assignment Step: Each data point is assigned to the nearest centroid.
Update Step: Centroids are recalculated based on the current assignment of points.
Iteration: The algorithm repeats the assignment and update steps until convergence (centroids stop changing significantly).

Steps in k-Means:

Initialization: Randomly select $k$ initial centroids.
Assignment: Assign each data point to the nearest centroid.
Update: Recalculate the centroids as the mean of all points in the cluster.
Repeat: Repeat steps 2 and 3 until convergence.

2. When to Use k-Means?

When you have unlabeled data and want to discover natural groupings.
For datasets with spherical or well-separated clusters.
Applications include market segmentation, image compression, and anomaly detection.

3. Implementation in Python

Let’s implement k-Means clustering on a synthetic dataset.

Step 1: Import Libraries

import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns

Step 2: Prepare the Data

We’ll generate a synthetic dataset with three clusters.

# Generate synthetic data
np.random.seed(0)
X = np.vstack((np.random.normal(0, 1, (100, 2)),
               np.random.normal(5, 1, (100, 2)),
               np.random.normal(-5, 1, (100, 2))))

Step 3: Apply k-Means Clustering

We’ll use $k = 3$ clusters for this example.

k = 3
kmeans = KMeans(n_clusters=k, random_state=0)
y_kmeans = kmeans.fit_predict(X)

Step 4: Visualize the Clusters

plt.figure(figsize=(8, 6))
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y_kmeans, palette='viridis', s=50, edgecolor='k')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red', label='Centroids')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('k-Means Clustering (k=3)')
plt.legend()
plt.show()

4. Choosing the Number of Clusters ( $k$ )

Selecting the right number of clusters is crucial. Two common methods are:

Elbow Method

The Elbow Method plots the Within-Cluster Sum of Squares (WCSS) against the number of clusters. The “elbow” point (where the rate of decrease slows) indicates the optimal $k$ .

# Elbow Method
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, random_state=0)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

plt.figure(figsize=(8, 6))
plt.plot(range(1, 11), wcss, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.title('Elbow Method')
plt.show()

Silhouette Score

The Silhouette Score measures how similar an object is to its own cluster compared to other clusters. Higher scores indicate better-defined clusters.

from sklearn.metrics import silhouette_score

# Silhouette Score
silhouette_scores = []
for i in range(2, 11):
    kmeans = KMeans(n_clusters=i, random_state=0)
    y_kmeans = kmeans.fit_predict(X)
    silhouette_scores.append(silhouette_score(X, y_kmeans))

plt.figure(figsize=(8, 6))
plt.plot(range(2, 11), silhouette_scores, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score')
plt.show()

5. Key Evaluation Metrics

Within-Cluster Sum of Squares (WCSS): Measures the compactness of clusters. Lower WCSS indicates tighter clusters.
Silhouette Score: Ranges from -1 to 1. Higher values indicate better-defined clusters.

6. Key Takeaways

k-Means is simple, efficient, and works well for spherical clusters.
The choice of $k$ is critical and can be determined using the Elbow Method or Silhouette Score.
It’s sensitive to the initial placement of centroids and may struggle with non-spherical or overlapping clusters.

7. Applications of k-Means

Market Segmentation: Grouping customers based on behavior.
Image Compression: Reducing the number of colors in an image.
Anomaly Detection: Identifying outliers in datasets.

8. Practice Exercise

Experiment with different values of $k$ and observe how it affects the clustering results.
Apply k-Means to a real-world dataset (e.g., Mall Customer Segmentation) and evaluate the clusters.
Compare k-Means with other clustering algorithms like DBSCAN or Hierarchical Clustering.

9. Additional Resources

That’s it for Day 10! Tomorrow, we’ll explore Hierarchical Clustering, another powerful unsupervised learning algorithm. Keep practicing, and feel free to ask questions in the comments! 🚀