Welcome to Day 11 of the 30 Days of Data Science Series! Today, we’re diving into Hierarchical Clustering, a powerful unsupervised learning algorithm that builds a hierarchy of clusters. By the end of this lesson, you’ll understand the concept, implementation, and evaluation of hierarchical clustering in Python.
1. What is Hierarchical Clustering?
Hierarchical clustering is an unsupervised learning algorithm that builds a hierarchy of clusters, often represented as a dendrogram. It doesn’t require the number of clusters to be specified in advance, making it flexible for exploratory data analysis.
Types of Hierarchical Clustering:
Agglomerative (Bottom-Up):
Starts with each data point as a single cluster.
Iteratively merges the closest pairs of clusters until all points are in one cluster or the desired number of clusters is reached.
Divisive (Top-Down):
Starts with all data points in a single cluster.
Iteratively splits the most heterogeneous cluster until each point is in its own cluster or the desired number of clusters is reached.
Linkage Criteria:
The choice of how to measure the distance between clusters affects the structure of the dendrogram:
Single Linkage: Minimum distance between points in two clusters.
Complete Linkage: Maximum distance between points in two clusters.
Average Linkage: Average distance between points in two clusters.
Ward’s Method: Minimizes the variance within clusters.
2. When to Use Hierarchical Clustering?
When you want to explore the hierarchical structure of the data.
For datasets where the number of clusters is unknown.
Applications include gene expression analysis, document clustering, and image segmentation.
3. Implementation in Python
Let’s implement hierarchical clustering on a synthetic dataset.
Step 1: Import Libraries
import numpy as np import pandas as pd from scipy.cluster.hierarchy import dendrogram, linkage, fcluster import matplotlib.pyplot as plt import seaborn as sns
Step 2: Prepare the Data
We’ll generate a synthetic dataset with three clusters.
# Generate synthetic data np.random.seed(0) X = np.vstack((np.random.normal(0, 1, (100, 2)), np.random.normal(5, 1, (100, 2)), np.random.normal(-5, 1, (100, 2))))
Step 3: Perform Hierarchical Clustering
We’ll use Ward’s method for linkage.
# Perform hierarchical clustering Z = linkage(X, method='ward')
Step 4: Plot the Dendrogram
plt.figure(figsize=(10, 7)) dendrogram(Z, truncate_mode='level', p=5, leaf_rotation=90., leaf_font_size=12., show_contracted=True) plt.title('Hierarchical Clustering Dendrogram') plt.xlabel('Sample Index') plt.ylabel('Distance') plt.show()
Step 5: Cut the Dendrogram to Form Clusters
We’ll cut the dendrogram at a distance threshold of 7.0.
# Cut the dendrogram to form clusters max_d = 7.0 # Distance threshold clusters = fcluster(Z, max_d, criterion='distance')
Step 6: Visualize the Clusters
plt.figure(figsize=(8, 6)) sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=clusters, palette='viridis', s=50, edgecolor='k') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.title('Hierarchical Clustering') plt.show()
4. Choosing the Number of Clusters
The dendrogram helps visualize the hierarchy of clusters. The choice of where to cut the dendrogram (i.e., selecting a threshold distance) determines the number of clusters. Some guidelines include:
Elbow Method: Look for an “elbow” in the dendrogram where the distance between merges increases significantly.
Maximum Distance: Choose a distance threshold that balances the number of clusters and the compactness of clusters.
5. Key Takeaways
Hierarchical clustering builds a hierarchy of clusters, represented as a dendrogram.
It doesn’t require the number of clusters to be specified in advance.
The choice of linkage criteria and distance threshold affects the clustering results.
6. Applications of Hierarchical Clustering
Gene Expression Data: Grouping similar genes or samples in bioinformatics.
Document Clustering: Organizing documents into a hierarchical structure.
Image Segmentation: Dividing an image into regions based on pixel similarity.
7. Practice Exercise
Experiment with different linkage methods (e.g., single, complete, average) and observe how they affect the dendrogram.
Apply hierarchical clustering to a real-world dataset (e.g., customer segmentation) and evaluate the results.
Compare hierarchical clustering with k-Means clustering on the same dataset.
8. Additional Resources
That’s it for Day 11! Tomorrow, we’ll explore DBSCAN (Density-Based Spatial Clustering of Applications with Noise), another powerful clustering algorithm. Keep practicing, and feel free to ask questions in the comments! 🚀