Course Content
Machine Learning in just 30 Days
0/39
Data Science 30 Days Course easy to learn

    Welcome to Day 11 of the 30 Days of Data Science Series! Today, we’re diving into Hierarchical Clustering, a powerful unsupervised learning algorithm that builds a hierarchy of clusters. By the end of this lesson, you’ll understand the concept, implementation, and evaluation of hierarchical clustering in Python.


    1. What is Hierarchical Clustering?

    Hierarchical clustering is an unsupervised learning algorithm that builds a hierarchy of clusters, often represented as a dendrogram. It doesn’t require the number of clusters to be specified in advance, making it flexible for exploratory data analysis.

    Types of Hierarchical Clustering:

    1. Agglomerative (Bottom-Up):

      • Starts with each data point as a single cluster.

      • Iteratively merges the closest pairs of clusters until all points are in one cluster or the desired number of clusters is reached.

    2. Divisive (Top-Down):

      • Starts with all data points in a single cluster.

      • Iteratively splits the most heterogeneous cluster until each point is in its own cluster or the desired number of clusters is reached.

    Linkage Criteria:

    The choice of how to measure the distance between clusters affects the structure of the dendrogram:

    • Single Linkage: Minimum distance between points in two clusters.

    • Complete Linkage: Maximum distance between points in two clusters.

    • Average Linkage: Average distance between points in two clusters.

    • Ward’s Method: Minimizes the variance within clusters.


    2. When to Use Hierarchical Clustering?

    • When you want to explore the hierarchical structure of the data.

    • For datasets where the number of clusters is unknown.

    • Applications include gene expression analysis, document clustering, and image segmentation.


    3. Implementation in Python

    Let’s implement hierarchical clustering on a synthetic dataset.

    Step 1: Import Libraries

    python
    Copy
    import numpy as np
    import pandas as pd
    from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
    import matplotlib.pyplot as plt
    import seaborn as sns

    Step 2: Prepare the Data

    We’ll generate a synthetic dataset with three clusters.

    python
    Copy
    # Generate synthetic data
    np.random.seed(0)
    X = np.vstack((np.random.normal(0, 1, (100, 2)),
                   np.random.normal(5, 1, (100, 2)),
                   np.random.normal(-5, 1, (100, 2))))

    Step 3: Perform Hierarchical Clustering

    We’ll use Ward’s method for linkage.

    python
    Copy
    # Perform hierarchical clustering
    Z = linkage(X, method='ward')

    Step 4: Plot the Dendrogram

    python
    Copy
    plt.figure(figsize=(10, 7))
    dendrogram(Z, truncate_mode='level', p=5, leaf_rotation=90., leaf_font_size=12., show_contracted=True)
    plt.title('Hierarchical Clustering Dendrogram')
    plt.xlabel('Sample Index')
    plt.ylabel('Distance')
    plt.show()

    Step 5: Cut the Dendrogram to Form Clusters

    We’ll cut the dendrogram at a distance threshold of 7.0.

    python
    Copy
    # Cut the dendrogram to form clusters
    max_d = 7.0  # Distance threshold
    clusters = fcluster(Z, max_d, criterion='distance')

    Step 6: Visualize the Clusters

    python
    Copy
    plt.figure(figsize=(8, 6))
    sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=clusters, palette='viridis', s=50, edgecolor='k')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title('Hierarchical Clustering')
    plt.show()

    4. Choosing the Number of Clusters

    The dendrogram helps visualize the hierarchy of clusters. The choice of where to cut the dendrogram (i.e., selecting a threshold distance) determines the number of clusters. Some guidelines include:

    • Elbow Method: Look for an “elbow” in the dendrogram where the distance between merges increases significantly.

    • Maximum Distance: Choose a distance threshold that balances the number of clusters and the compactness of clusters.


    5. Key Takeaways

    • Hierarchical clustering builds a hierarchy of clusters, represented as a dendrogram.

    • It doesn’t require the number of clusters to be specified in advance.

    • The choice of linkage criteria and distance threshold affects the clustering results.


    6. Applications of Hierarchical Clustering

    • Gene Expression Data: Grouping similar genes or samples in bioinformatics.

    • Document Clustering: Organizing documents into a hierarchical structure.

    • Image Segmentation: Dividing an image into regions based on pixel similarity.


    7. Practice Exercise

    1. Experiment with different linkage methods (e.g., single, complete, average) and observe how they affect the dendrogram.

    2. Apply hierarchical clustering to a real-world dataset (e.g., customer segmentation) and evaluate the results.

    3. Compare hierarchical clustering with k-Means clustering on the same dataset.


    8. Additional Resources


    That’s it for Day 11! Tomorrow, we’ll explore DBSCAN (Density-Based Spatial Clustering of Applications with Noise), another powerful clustering algorithm. Keep practicing, and feel free to ask questions in the comments! 🚀

    Scroll to Top
    Verified by MonsterInsights