4.8 KiB

Czysty Wina Historia

Hierarchical Clustering

Hierarchical Clustering is a method of cluster analysis that seeks to build a hierarchy of clusters. This README provides an overview of the hierarchical clustering algorithm, including its fundamental concepts, types, steps, and how to implement it using Python.

Introduction

Hierarchical Clustering is an unsupervised learning method used to group similar objects into clusters. Unlike other clustering techniques, hierarchical clustering does not require the number of clusters to be specified beforehand. It produces a tree-like structure called a dendrogram, which displays the arrangement of the clusters and their sub-clusters.

Concepts

Dendrogram

A dendrogram is a tree-like diagram that records the sequences of merges or splits. It is a useful tool for visualizing the process of hierarchical clustering.

Distance Measure

Distance measures are used to quantify the similarity or dissimilarity between data points. Common distance measures include Euclidean distance, Manhattan distance, and cosine similarity.

Linkage Criteria

Linkage criteria determine how the distance between clusters is calculated. Different linkage criteria include single linkage, complete linkage, average linkage, and Ward's linkage.

Types of Hierarchical Clustering

Agglomerative Clustering (Bottom-Up Approach):
- Starts with each data point as a separate cluster.
- Repeatedly merges the closest pairs of clusters until only one cluster remains or a stopping criterion is met.
Divisive Clustering (Top-Down Approach):
- Starts with all data points in a single cluster.
- Repeatedly splits clusters into smaller clusters until each data point is its own cluster or a stopping criterion is met.

Steps in Hierarchical Clustering

Calculate Distance Matrix: Compute the distance between each pair of data points.
Create Clusters: Treat each data point as a single cluster.
Merge Closest Clusters: Find the two clusters that are closest to each other and merge them into a single cluster.
Update Distance Matrix: Update the distance matrix to reflect the distance between the new cluster and the remaining clusters.
Repeat: Repeat steps 3 and 4 until all data points are merged into a single cluster or the desired number of clusters is achieved.

Linkage Criteria

Single Linkage (Minimum Linkage): The distance between two clusters is defined as the minimum distance between any single data point in the first cluster and any single data point in the second cluster.
Complete Linkage (Maximum Linkage): The distance between two clusters is defined as the maximum distance between any single data point in the first cluster and any single data point in the second cluster.
Average Linkage: The distance between two clusters is defined as the average distance between all pairs of data points, one from each cluster.
Ward's Linkage: The distance between two clusters is defined as the increase in the sum of squared deviations from the mean when the two clusters are merged.

Implementation

Using Scikit-learn

Scikit-learn is a popular machine learning library in Python that provides tools for hierarchical clustering.

Code Example

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler

# Load dataset
data = pd.read_csv('path/to/your/dataset.csv')

# Preprocess the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# Perform hierarchical clustering
Z = linkage(data_scaled, method='ward')

# Plot the dendrogram
plt.figure(figsize=(10, 7))
dendrogram(Z)
plt.title('Dendrogram')
plt.xlabel('Data Points')
plt.ylabel('Distance')
plt.show()

# Perform Agglomerative Clustering
agg_clustering = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')
labels = agg_clustering.fit_predict(data_scaled)

# Add cluster labels to the original data
data['Cluster'] = labels
print(data.head())

Evaluation Metrics

Silhouette Score: Measures how similar a data point is to its own cluster compared to other clusters.
Cophenetic Correlation Coefficient: Measures how faithfully a dendrogram preserves the pairwise distances between the original data points.
Dunn Index: Ratio of the minimum inter-cluster distance to the maximum intra-cluster distance.

Conclusion

Hierarchical clustering is a versatile and intuitive method for clustering data. It is particularly useful when the number of clusters is not known beforehand. By understanding the different linkage criteria and evaluation metrics, one can effectively apply hierarchical clustering to various types of data.

4.8 KiB Czysty Wina Historia

Hierarchical Clustering

Introduction

Concepts

Dendrogram

Distance Measure

Linkage Criteria

Types of Hierarchical Clustering

Steps in Hierarchical Clustering

Linkage Criteria

Implementation

Using Scikit-learn

Code Example

Evaluation Metrics

Conclusion

4.8 KiB

Czysty Wina Historia