Course Content
Machine Learning in just 30 Days
0/39
Data Science 30 Days Course easy to learn

    Welcome to Day 10 of the 30 Days of Data Science Series! Today, we’re diving into k-Means Clustering, a popular unsupervised learning algorithm used for grouping data into clusters. By the end of this lesson, you’ll understand the concept, implementation, and evaluation of k-Means in Python.


    1. What is k-Means Clustering?

    k-Means is an unsupervised learning algorithm used to partition a dataset into k clusters, where each data point belongs to the cluster with the nearest mean (centroid). The algorithm aims to minimize the within-cluster variance, making the clusters as compact as possible.

    Key Concepts:

    1. Centroids: The center of each cluster, calculated as the mean of all points in the cluster.

    2. Assignment Step: Each data point is assigned to the nearest centroid.

    3. Update Step: Centroids are recalculated based on the current assignment of points.

    4. Iteration: The algorithm repeats the assignment and update steps until convergence (centroids stop changing significantly).

    Steps in k-Means:

    1. Initialization: Randomly select k initial centroids.

    2. Assignment: Assign each data point to the nearest centroid.

    3. Update: Recalculate the centroids as the mean of all points in the cluster.

    4. Repeat: Repeat steps 2 and 3 until convergence.


    2. When to Use k-Means?

    • When you have unlabeled data and want to discover natural groupings.

    • For datasets with spherical or well-separated clusters.

    • Applications include market segmentation, image compression, and anomaly detection.


    3. Implementation in Python

    Let’s implement k-Means clustering on a synthetic dataset.

    Step 1: Import Libraries

    python
    Copy
    import numpy as np
    import pandas as pd
    from sklearn.cluster import KMeans
    import matplotlib.pyplot as plt
    import seaborn as sns

    Step 2: Prepare the Data

    We’ll generate a synthetic dataset with three clusters.

    python
    Copy
    # Generate synthetic data
    np.random.seed(0)
    X = np.vstack((np.random.normal(0, 1, (100, 2)),
                   np.random.normal(5, 1, (100, 2)),
                   np.random.normal(-5, 1, (100, 2))))

    Step 3: Apply k-Means Clustering

    We’ll use k=3 clusters for this example.

    python
    Copy
    k = 3
    kmeans = KMeans(n_clusters=k, random_state=0)
    y_kmeans = kmeans.fit_predict(X)

    Step 4: Visualize the Clusters

    python
    Copy
    plt.figure(figsize=(8, 6))
    sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y_kmeans, palette='viridis', s=50, edgecolor='k')
    plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red', label='Centroids')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title('k-Means Clustering (k=3)')
    plt.legend()
    plt.show()

    4. Choosing the Number of Clusters (k)

    Selecting the right number of clusters is crucial. Two common methods are:

    Elbow Method

    The Elbow Method plots the Within-Cluster Sum of Squares (WCSS) against the number of clusters. The “elbow” point (where the rate of decrease slows) indicates the optimal k.

    python
    Copy
    # Elbow Method
    wcss = []
    for i in range(1, 11):
        kmeans = KMeans(n_clusters=i, random_state=0)
        kmeans.fit(X)
        wcss.append(kmeans.inertia_)
    
    plt.figure(figsize=(8, 6))
    plt.plot(range(1, 11), wcss, marker='o')
    plt.xlabel('Number of Clusters')
    plt.ylabel('WCSS')
    plt.title('Elbow Method')
    plt.show()

    Silhouette Score

    The Silhouette Score measures how similar an object is to its own cluster compared to other clusters. Higher scores indicate better-defined clusters.

    python
    Copy
    from sklearn.metrics import silhouette_score
    
    # Silhouette Score
    silhouette_scores = []
    for i in range(2, 11):
        kmeans = KMeans(n_clusters=i, random_state=0)
        y_kmeans = kmeans.fit_predict(X)
        silhouette_scores.append(silhouette_score(X, y_kmeans))
    
    plt.figure(figsize=(8, 6))
    plt.plot(range(2, 11), silhouette_scores, marker='o')
    plt.xlabel('Number of Clusters')
    plt.ylabel('Silhouette Score')
    plt.title('Silhouette Score')
    plt.show()

    5. Key Evaluation Metrics

    1. Within-Cluster Sum of Squares (WCSS): Measures the compactness of clusters. Lower WCSS indicates tighter clusters.

    2. Silhouette Score: Ranges from -1 to 1. Higher values indicate better-defined clusters.


    6. Key Takeaways

    • k-Means is simple, efficient, and works well for spherical clusters.

    • The choice of k is critical and can be determined using the Elbow Method or Silhouette Score.

    • It’s sensitive to the initial placement of centroids and may struggle with non-spherical or overlapping clusters.


    7. Applications of k-Means

    • Market Segmentation: Grouping customers based on behavior.

    • Image Compression: Reducing the number of colors in an image.

    • Anomaly Detection: Identifying outliers in datasets.


    8. Practice Exercise

    1. Experiment with different values of k and observe how it affects the clustering results.

    2. Apply k-Means to a real-world dataset (e.g., Mall Customer Segmentation) and evaluate the clusters.

    3. Compare k-Means with other clustering algorithms like DBSCAN or Hierarchical Clustering.


    9. Additional Resources


    That’s it for Day 10! Tomorrow, we’ll explore Hierarchical Clustering, another powerful unsupervised learning algorithm. Keep practicing, and feel free to ask questions in the comments! 🚀

    Scroll to Top
    Verified by MonsterInsights