Day 11: Mastering Hierarchical Clustering

Data Science 30 Days Course easy to learn

Welcome to Day 11 of the 30 Days of Data Science Series! Today, we’re diving into Hierarchical Clustering, a powerful unsupervised learning algorithm that builds a hierarchy of clusters. By the end of this lesson, you’ll understand the concept, implementation, and evaluation of hierarchical clustering in Python.

1. What is Hierarchical Clustering?

Hierarchical clustering is an unsupervised learning algorithm that builds a hierarchy of clusters, often represented as a dendrogram. It doesn’t require the number of clusters to be specified in advance, making it flexible for exploratory data analysis.

Types of Hierarchical Clustering:

Agglomerative (Bottom-Up):
- Starts with each data point as a single cluster.
- Iteratively merges the closest pairs of clusters until all points are in one cluster or the desired number of clusters is reached.
Divisive (Top-Down):
- Starts with all data points in a single cluster.
- Iteratively splits the most heterogeneous cluster until each point is in its own cluster or the desired number of clusters is reached.

Linkage Criteria:

The choice of how to measure the distance between clusters affects the structure of the dendrogram:

Single Linkage: Minimum distance between points in two clusters.
Complete Linkage: Maximum distance between points in two clusters.
Average Linkage: Average distance between points in two clusters.
Ward’s Method: Minimizes the variance within clusters.

2. When to Use Hierarchical Clustering?

When you want to explore the hierarchical structure of the data.
For datasets where the number of clusters is unknown.
Applications include gene expression analysis, document clustering, and image segmentation.

3. Implementation in Python

Let’s implement hierarchical clustering on a synthetic dataset.

Step 1: Import Libraries

import numpy as np
import pandas as pd
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
import matplotlib.pyplot as plt
import seaborn as sns

Step 2: Prepare the Data

We’ll generate a synthetic dataset with three clusters.

# Generate synthetic data
np.random.seed(0)
X = np.vstack((np.random.normal(0, 1, (100, 2)),
               np.random.normal(5, 1, (100, 2)),
               np.random.normal(-5, 1, (100, 2))))

Step 3: Perform Hierarchical Clustering

We’ll use Ward’s method for linkage.

# Perform hierarchical clustering
Z = linkage(X, method='ward')

Step 4: Plot the Dendrogram

plt.figure(figsize=(10, 7))
dendrogram(Z, truncate_mode='level', p=5, leaf_rotation=90., leaf_font_size=12., show_contracted=True)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()

Step 5: Cut the Dendrogram to Form Clusters

We’ll cut the dendrogram at a distance threshold of 7.0.

# Cut the dendrogram to form clusters
max_d = 7.0  # Distance threshold
clusters = fcluster(Z, max_d, criterion='distance')

Step 6: Visualize the Clusters

plt.figure(figsize=(8, 6))
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=clusters, palette='viridis', s=50, edgecolor='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Hierarchical Clustering')
plt.show()

4. Choosing the Number of Clusters

The dendrogram helps visualize the hierarchy of clusters. The choice of where to cut the dendrogram (i.e., selecting a threshold distance) determines the number of clusters. Some guidelines include:

Elbow Method: Look for an “elbow” in the dendrogram where the distance between merges increases significantly.
Maximum Distance: Choose a distance threshold that balances the number of clusters and the compactness of clusters.

5. Key Takeaways

Hierarchical clustering builds a hierarchy of clusters, represented as a dendrogram.
It doesn’t require the number of clusters to be specified in advance.
The choice of linkage criteria and distance threshold affects the clustering results.

6. Applications of Hierarchical Clustering

Gene Expression Data: Grouping similar genes or samples in bioinformatics.
Document Clustering: Organizing documents into a hierarchical structure.
Image Segmentation: Dividing an image into regions based on pixel similarity.

7. Practice Exercise

Experiment with different linkage methods (e.g., single, complete, average) and observe how they affect the dendrogram.
Apply hierarchical clustering to a real-world dataset (e.g., customer segmentation) and evaluate the results.
Compare hierarchical clustering with k-Means clustering on the same dataset.

8. Additional Resources

That’s it for Day 11! Tomorrow, we’ll explore DBSCAN (Density-Based Spatial Clustering of Applications with Noise), another powerful clustering algorithm. Keep practicing, and feel free to ask questions in the comments! 🚀