Welcome to Day 10 of the 30 Days of Data Science Series! Today, we’re diving into k-Means Clustering, a popular unsupervised learning algorithm used for grouping data into clusters. By the end of this lesson, you’ll understand the concept, implementation, and evaluation of k-Means in Python.
1. What is k-Means Clustering?
k-Means is an unsupervised learning algorithm used to partition a dataset into k clusters, where each data point belongs to the cluster with the nearest mean (centroid). The algorithm aims to minimize the within-cluster variance, making the clusters as compact as possible.
Key Concepts:
Centroids: The center of each cluster, calculated as the mean of all points in the cluster.
Assignment Step: Each data point is assigned to the nearest centroid.
Update Step: Centroids are recalculated based on the current assignment of points.
Iteration: The algorithm repeats the assignment and update steps until convergence (centroids stop changing significantly).
Steps in k-Means:
Initialization: Randomly select k initial centroids.
Assignment: Assign each data point to the nearest centroid.
Update: Recalculate the centroids as the mean of all points in the cluster.
Repeat: Repeat steps 2 and 3 until convergence.
2. When to Use k-Means?
When you have unlabeled data and want to discover natural groupings.
For datasets with spherical or well-separated clusters.
Applications include market segmentation, image compression, and anomaly detection.
3. Implementation in Python
Let’s implement k-Means clustering on a synthetic dataset.
Step 1: Import Libraries
import numpy as np import pandas as pd from sklearn.cluster import KMeans import matplotlib.pyplot as plt import seaborn as sns
Step 2: Prepare the Data
We’ll generate a synthetic dataset with three clusters.
# Generate synthetic data np.random.seed(0) X = np.vstack((np.random.normal(0, 1, (100, 2)), np.random.normal(5, 1, (100, 2)), np.random.normal(-5, 1, (100, 2))))
Step 3: Apply k-Means Clustering
We’ll use k=3 clusters for this example.
k = 3 kmeans = KMeans(n_clusters=k, random_state=0) y_kmeans = kmeans.fit_predict(X)
Step 4: Visualize the Clusters
plt.figure(figsize=(8, 6)) sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y_kmeans, palette='viridis', s=50, edgecolor='k') plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red', label='Centroids') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.title('k-Means Clustering (k=3)') plt.legend() plt.show()
4. Choosing the Number of Clusters (k)
Selecting the right number of clusters is crucial. Two common methods are:
Elbow Method
The Elbow Method plots the Within-Cluster Sum of Squares (WCSS) against the number of clusters. The “elbow” point (where the rate of decrease slows) indicates the optimal k.
# Elbow Method wcss = [] for i in range(1, 11): kmeans = KMeans(n_clusters=i, random_state=0) kmeans.fit(X) wcss.append(kmeans.inertia_) plt.figure(figsize=(8, 6)) plt.plot(range(1, 11), wcss, marker='o') plt.xlabel('Number of Clusters') plt.ylabel('WCSS') plt.title('Elbow Method') plt.show()
Silhouette Score
The Silhouette Score measures how similar an object is to its own cluster compared to other clusters. Higher scores indicate better-defined clusters.
from sklearn.metrics import silhouette_score # Silhouette Score silhouette_scores = [] for i in range(2, 11): kmeans = KMeans(n_clusters=i, random_state=0) y_kmeans = kmeans.fit_predict(X) silhouette_scores.append(silhouette_score(X, y_kmeans)) plt.figure(figsize=(8, 6)) plt.plot(range(2, 11), silhouette_scores, marker='o') plt.xlabel('Number of Clusters') plt.ylabel('Silhouette Score') plt.title('Silhouette Score') plt.show()
5. Key Evaluation Metrics
Within-Cluster Sum of Squares (WCSS): Measures the compactness of clusters. Lower WCSS indicates tighter clusters.
Silhouette Score: Ranges from -1 to 1. Higher values indicate better-defined clusters.
6. Key Takeaways
k-Means is simple, efficient, and works well for spherical clusters.
The choice of k is critical and can be determined using the Elbow Method or Silhouette Score.
It’s sensitive to the initial placement of centroids and may struggle with non-spherical or overlapping clusters.
7. Applications of k-Means
Market Segmentation: Grouping customers based on behavior.
Image Compression: Reducing the number of colors in an image.
Anomaly Detection: Identifying outliers in datasets.
8. Practice Exercise
Experiment with different values of k and observe how it affects the clustering results.
Apply k-Means to a real-world dataset (e.g., Mall Customer Segmentation) and evaluate the clusters.
Compare k-Means with other clustering algorithms like DBSCAN or Hierarchical Clustering.
9. Additional Resources
That’s it for Day 10! Tomorrow, we’ll explore Hierarchical Clustering, another powerful unsupervised learning algorithm. Keep practicing, and feel free to ask questions in the comments! 🚀