Hierarchical Clustering

Given $n$ points in a d-dimensional space, the goal of hierarchical clustering is to create a sequence of nested partitions, which can be conveniently visualized via a tree or hierarchy of clusters, also called the cluster dendrogram. The clusters in the hierarchy range from the fine-grained to the coarse-grained – the lowest level of the tree (the leaves) consists of each point in its own cluster, whereas the highest level (the root) consists of all points in one cluster. Both of these may be considered to be trivial clusterings.

At some intermediate level, we may find meaningful clusters. If the user supplies $k$, the desired number of clusters, we can choose the level at which there are $k$ clusters.There are two main algorithmic approaches to mine hierarchical clusters:

agglomerative and divisive. 

Agglomerative strategies work in a bottom-up manner.That is, starting with each of the $n$ points in a separate cluster, they repeatedly merge the most similar pair of clusters until all points are members of the same cluster.( additive hierarchical clustering)
Divisive strategies do just the opposite, working in a top-down manner. Starting with all the points in the same cluster, they recursively split the clusters until all points are in separate clusters.

PRELIMINARIES
Given a dataset $D = \{x_1,\ldots ,x_n\}$, where $x_i \in R^d$ , a clustering $C = \{C_1,\ldots ,C_k\}$ is a partition of $D$, that is, each cluster is a set of points $C_i \subseteq D$, such that the clusters are pairwise disjoint $C_i \cap Cj = \phi$ (for all $i \ne j$ ), and $\cup_{i=1}^k C_i=D$

A clustering $A = \{A_1,\ldots ,A_r\}$ is said to be nested in another clustering $B =\{B_1,\ldots ,B_s\}$ if and only if $r > s$, and for each cluster $A_i \in A$, there exists a cluster $B_j \in B$, such that $A_i \subseteq B_j$ .

Hierarchical clustering yields a sequence of $n$ nested partitions $C_1,\ldots ,C_n$, ranging from the trivial clustering 
$C_1 =\{\{x_1\},\ldots , \{x_n\}\}$
where each point is in a separate cluster, to the other trivial clustering $C_n =\{x_1, \ldots ,x_n\}$ , where all points are in one cluster. In general, the clustering $C_{t−1}$ is nested in the clustering $C_t$. The cluster dendrogram is a rooted binary tree that captures this nesting structure, with edges between cluster $C_i \in C_{t−1}$ and cluster $C_j \in C_t$, if $C_i$ is nested in $C_j$ , that is, if $C_i \subset C_j$ . In this way the dendrogram captures the entire sequence of nested clusterings.
Number of Hierarchical Clusterings
The number of different nested or hierarchical clusterings corresponds to the number of different binary rooted trees or dendrograms with $n$ leaves with distinct labels. Any tree with $t$ nodes has $t −1$ edges. Also, any rooted binary tree with $m$ leaves has $m−1$ internal nodes. Thus, a dendrogram with $m$ leaf nodes has a total of $t = m+ m−1 =2m−1$ nodes, and consequently $t −1=2m−2$ edges. 

To count the number of different dendrogram topologies, let us consider how we can extend a dendrogram with $m$ leaves by adding an extra leaf, to yield a dendrogram with $m+1$ leaves. Note that we can add the extra leaf by splitting (i.e., branching from) any of the $2m− 2$ edges. Further, we can also add the new leaf as a child of a new root, giving $2m − 2 + 1 = 2m − 1$ new dendrograms with $m + 1$ leaves. The total number of different dendrograms with $n$ leaves is thus obtained by the following product:

$\prod_{m=1}^{n-1} ( 2m-1)=1\times3\times \times 5 \times \ldots \times ( 2n-3)=(2n-3)!$

The index m in Eq. above  goes up to $n−1$ because the last term in the product denotes the number of dendrograms one obtains when we extend a dendrogram with $n − 1$ leaves by adding one more leaf, to yield dendrograms with $n$ leaves.The number of possible hierarchical clusterings is thus given as $(2n− 3)!$, which grows extremely rapidly. It is obvious that a naive approach of enumerating all possible hierarchical clusterings is simply infeasible.

Figure shows the number of trees with one, two, and three leaves.The gray nodes are the virtual roots, and the black dots indicate locations where a new leaf can be added. There is only one tree possible with a single leaf, as shown in Figure (a). It can be extended in only one way to yield the unique tree with two leaves in Figure (b). However, this tree has three possible locations where the third leaf can be added. Each of these cases is shown in Figure (c).We can further see that each of the trees with $m = 3$ leaves has five locations where the fourth leaf can be added, and so on, which confirms the equation for the number of hierarchical clusterings in equation that is $(2n-3)!$

AGGLOMERATIVE HIERARCHICAL CLUSTERING ( AHC)

In agglomerative hierarchical clustering, we begin with each of the $n$ points in a separate cluster. We repeatedly merge the two closest clusters until all points are members of the same cluster, as shown in the pseudo-code given in Algorithm .
Formally, given a set of clusters $C = \{C_1,C_2,\ldots,C_m\}$, we find the closest pair of clusters $C_i$ and $C_j$ and merge them into a new cluster $C_{ij} = C_i \cup C_j$ . Next, we update the set of clusters by removing $C_i$ and $C_j$ and adding $C_{ij}$ , as follows 
$C =(C - \{Ci,Cj \}) \cup  \{Cij \}$.

We repeat the process until $C$ contains only one cluster. Because the number of clusters decreases by one in each step, this process results in a sequence of $n$ nested clusterings. If specified, we can stop the merging process when there are exactly $k$ clusters remaining.


Distance between Clusters

The main step in the algorithm is to determine the closest pair of clusters. Several distance measures,  such as single link, complete link, group average, and others discussed in the following section, can be used to compute the distance between any two clusters. The between-cluster distances are ultimately based on the distance between two points, which is typically computed using the Euclidean distance or
L2-norm, defined as

$\delta(x,y)=||x-y||_2=\left( \sum_{i=1}^d (x_i-y_i)^2\right )^{1/2}$

However, one may use other distance metrics, or if available one may a user-specified distance matrix.

Single Link

Given two clusters $C_i$ and $C_j$ , the distance between them, denoted $\delta(C_i ,C_j )$, is defined
as the minimum distance between a point in $C_i$ and a point in $C_j$
$\delta(C_i ,C_j ) =min\{\delta(x,y) | x \in C_i,y \in C_j \}$

The name single link comes from the observation that if we choose the minimum distance between points in the two clusters and connect those points, then (typically) only a single link would exist between those clusters because all other pairs of points would be farther away.

Complete Link
The distance between two clusters is defined as the maximum distance between a point in $C_i$ and a point in $C_j$ :

$\delta(C_i ,C_j ) =max\{\delta(x,y) | x \in C_i,y \in C_j \}$

The name complete link conveys the fact that if we connect all pairs of points from the two clusters with distance at most $\delta(C_i ,C_j )$, then all possible pairs would be connected, that is, we get a complete linkage.

Group Average
The distance between two clusters is defined as the average pairwise distance between points in $C_i$ and $C_j$ :

$\delta(C_i,C_j ) =\frac{\sum_{x \in C_i} \sum_{y \in C_j} \delta(x,y)}{n_i.n_j}$

where $n_i = |C_i |$ denotes the number of points in cluster $C_i$ .

Mean Distance
The distance between two clusters is defined as the distance between the means or centroids of the two clusters:

$\delta(C_i,C_j)=\delta(\mu_i,\mu_j)$
where $\mu_i=\frac{1}{n_i} \sum_{x \in C_i} x$
Example  (Single Link)
Consider the single link clustering shown in Figure on a dataset of five points, whose pairwise distances are also shown on the bottom left. Initially, all points are in their own cluster. The closest pair of points are (A,B) and (C,D), both with $\delta= 1$. We choose to first merge A and B, and derive a new distance matrix for the merged cluster. Essentially, we have to compute the distances of the new cluster AB to all other clusters. For example, $\delta(AB,E) = 3$ because $\delta(AB,E) = min\{\delta(A,E),\delta(B,E)\} = min\{4,3\} = 3$. In the next step we merge C and D because they are the closest clusters, and we obtain a new distance matrix for the resulting set of clusters. After this, AB and CD are merged, and finally, E is merged with ABCD. In the distance matrices, we have shown (circled) the minimum distance used at each iteration that results in a merging of the two closest pairs of clusters.


Computational Complexity
In agglomerative clustering, we need to compute the distance of each cluster to all other clusters, and at each step the number of clusters decreases by 1. Initially it takes $O(n^2)$ time to create the pairwise distance matrix, unless it is specified as an input to the algorithm.
At each merge step, the distances from the merged cluster to the other clusters have to be recomputed, whereas the distances between the other clusters remain the same. This means that in step $t$ , we compute $O(n − t)$ distances. The other main operation is to find the closest pair in the distance matrix. For this we can keep the $n^2$ distances in a heap data structure, which allows us to find the minimum distance in O(1) time; creating the heap takes $O(n^2)$ time. Deleting/updating distances from the heap takes $O(logn)$ time for each operation, for a total time across all merge steps of $O(n^2 logn)$. Thus, the computational complexity of hierarchical clustering is $O(n^2 logn)$.
 
Example Python Code

We will use Agglomerative Clustering, a type of hierarchical clustering that follows a bottom up approach. We begin by treating each data point as its own cluster. Then, we join clusters together that have the shortest distance between them to create larger clusters. This step is repeated until one large cluster is formed containing all of the data points.

Hierarchical clustering requires us to decide on both a distance and linkage method. We will use euclidean distance and the Ward linkage method, which attempts to minimize the variance between clusters.

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

x = [4, 5, 10, 4, 3, 11, 14 , 6, 10, 12]
y = [21, 19, 24, 17, 16, 25, 24, 22, 21, 21]

data = list(zip(x, y))

linkage_data = linkage(data, method='ward', metric='euclidean')
dendrogram(linkage_data)

plt.show()  

 

Here, we do the same thing with Python's scikit-learn library. Then, visualize on a 2-dimensional plot: 

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering

x = [4, 5, 10, 4, 3, 11, 14 , 6, 10, 12]
y = [21, 19, 24, 17, 16, 25, 24, 22, 21, 21]

data = list(zip(x, y))

hierarchical_cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')
labels = hierarchical_cluster.fit_predict(data)

plt.scatter(x, y, c=labels)
plt.show() 

 


 

Comments

Popular posts from this blog

Concepts in Machine Learning- CST 383 KTU Minor Notes- Dr Binu V P

Overview of Machine Learning

Syllabus Concepts in Machine Learning- CST 383 KTU