Table Of Contents

Previous topic

neurospin.clustering.gmm

Next topic

neurospin.eda.dimension_reduction

This Page

neurospin.clustering.hierarchical_clustering

Module: neurospin.clustering.hierarchical_clustering

Inheritance diagram for nipy.neurospin.clustering.hierarchical_clustering:

These routines perform some hierrachical agglomerative clustering of some input data. The following alternatives are proposed: - Distance based average-link - Similarity-based average-link - Distance based maximum-link - Ward’s algorithm under graph constraints - Ward’s algorithm without graph constraints

In this latest version, the results are returned in a ‘WeightedForest’ structure, which gives access to the clustering hierarchy, facilitates the plot of the result etc.

For back-compatibility, *_segment versions of the algorithms have been appended, with the old API (except the qmax parameter, which now represents the number of wanted clusters)

Author : Bertrand Thirion,Pamela Guevara, 2006-2009

Class

WeightedForest

class nipy.neurospin.clustering.hierarchical_clustering.WeightedForest(V, parents=None, height=None)

Bases: nipy.neurospin.graph.graph.Forest

This is a weighted Forest structure, i.e. a tree - ecah node has one parent and children (hierarchical structure) - some of the nodes can be viewed as leaves, other as roots - the edges within a tree are associated with a weight: +1 from child to parent -1 from parent to child - additionally, the nodes have a value, which is called ‘height’, especially useful from dendrograms

fields: - V : (int,>0) the number of vertices - E : (int) the number of edges - parents: array of shape (self.V) the parent array - edges: array of shape (self.E,2) reprensenting pairwise neighbors - weights, array of shape (self.E), +1/-1 for scending/descending links - children: list of arrays that represents the childs of any node - height: array of shape(self.V)

__init__(V, parents=None, height=None)
INPUT: V: the number of edges of the graph parents = None: array of shape (V) the parents of the graph by default, the parents are set to range(V), i.e. each node is its own parent, and each node is a tree height=None: array of shape(V) the height of the nodes
check_compatible_height()
Check that height[parents[i]]>=height[i] for all nodes
fancy_plot(validleaves)
Idem plot, but the valid edges are enahanced
fancy_plot_(valid)
Idem plot, but the valid edges are enahanced
get_height()
get the height array
list_of_subtrees()
returns the list of all non-trivial subtrees in the graph Caveat: theis function assumes that the vertices are sorted in a way such that parent[i]>i forall i Only the leaves are listeed, not the subtrees themselves
partition(threshold)
partition the tree according to a cut criterion
plot()
Plot the dendrogram associated with self the rank of the data in the dendogram is returned
plot2(addNodes=False, font_size=10, cl_size=None)
Plot the dendrogram associated with self
plot_height()
plot the height of the non-leaves nodes
set_height(height=None)
set the height array
split(k)
idem as partition, but a number of components are supplied instead

Functions

Average link clustering based on a pairwise distance matrix.

Average_Link_Distance(D,verbose=0): INPUT: - D: a (n,n) distance matrix between some items - verbose=0, verbosity level OUTPUT: -t a weightForest structure that represents the dendrogram of the data NOTE: this method has not been optimized

Average link clustering based on a pairwise distance matrix.

Average_Link_Distance(D,stop=-1,qmax=-1,verbose=0): INPUT: - D: a (n,n) distance matrix between some items - stop=-1: stopping criterion, i.e. distance threshold at which further merges are forbidden By default, all merges are performed - qmax = 1; the number of desired clusters (in the limit of stop) - verbose=0, verbosity level OUTPUT: -u: a labelling of the graph vertices according to the criterion -cost the cost of each merge step during the clustering procedure NOTE: this method has not been optimized

Average link clustering based on data matrix. INPUT: - X: a (nbitem,dim) array from which an Euclidian distance matrix is computed - verbose=0, verbosity level OUTPUT: -t a weightForest structure that represents the dendrogram of the data NOTE: this method has not been optimized
Agglomerative function based on a (hopefully sparse) similarity graph INPUT: - G the input graph OUPUT: - t a weightForest structure that represents the dendrogram of the data CAVEAT: In that case, the homogeneity is associated with high similarity (as opposed to low cost as in most clustering procedures, e.g. distance-based procedures). Thus the tree is created with negated affinity values, in roder to respect the traditional ordering of cluster potentials. individual points have the potential (-np.infty). This problem is handled transparently inthe associated segment functionp.
Agglomerative function based on a (hopefully sparse) similarity graph INPUT: - G the input graph - stop = 0: the stopping crterion - qmax=1: the number of desired clusters (in the limit of the stopping criterion) OUPUT: -u: a labelling of the graph vertices according to the criterion -cost the cost of each merge step during the clustering procedure
maximum link clustering based on a pairwise distance matrix. Maximum_Link_Distance(D,stop=0,qmax=-1,verbose=0): INPUT: - D: a (n,n) distance matrix between some items - verbose=0, verbosity level OUTPUT: -t a weightForest structure that represents the dendrogram of the data NOTE: this method has not been optimized
maximum link clustering based on a pairwise distance matrix. Maximum_Link_Distance(D,stop=0,qmax=-1,verbose=0): INPUT: - D: a (n,n) distance matrix between some items - qmax = 1: the number of desired clusters (in the limit of stop) - stop=-1: stopping criterion, i.e. distance threshold at which further merges are forbidden By default (stop=-1), all merges are performed - verbose=0, verbosity level OUTPUT: -u: a labelling of the graph vertices according to the criterion -cost the cost of each merge step during the clustering procedure NOTE: this method has not been optimized
Maximum link clustering based on data matrix. INPUT: - X: a (nbitem,dim) array from which an Euclidian distance matrix is computed - verbose=0, verbosity level OUTPUT: -t a weightForest structure that represents the dendrogram of the data NOTE: this method has not been optimized
nipy.neurospin.clustering.hierarchical_clustering.Ward(G, feature, verbose=0)
Agglomerative function based on a topology-defining graph and a feature matrix. INPUT: - G the input graph (a topological graph essentially) - feature (G.V,dim_feature) array that yields some vectorial information related to the graph vertices OUPUT: -t: a WeightedForest structure that represents the dendrogram NOTE: When G has more than 1 connected component, t is no longer a tree. This case is handled cleanly now
nipy.neurospin.clustering.hierarchical_clustering.Ward_quick(G, feature, verbose=0)
Agglomerative function based on a topology-defining graph and a feature matrix. INPUT: - G the input graph (a topological graph essentially) - feature (G.V,dim_feature) array that yields some vectorial information related to the graph vertices OUPUT: - t a weightForest structure that represents the dendrogram of the data NOTE: - Hopefully a quicker version - A euclidean distance is used in the feature space CAVEAT : only approximate
nipy.neurospin.clustering.hierarchical_clustering.Ward_quick_segment(G, feature, stop=-1, qmax=1, verbose=0)
Agglomerative function based on a topology-defining graph and a feature matrix. INPUT: - G the input graph (a topological graph essentially) - feature (G.V,dim_feature) array that yields some vectorial information related to the graph vertices - stop = -1: the stopping crterion if stop==-1, then no stopping criterion is used - qmax=1: the maximum number of desired clusters (in the limit of the stopping criterion) OUPUT: -u: a labelling of the graph vertices according to the criterion -cost the cost of each merge step during the clustering procedure NOTE: - Hopefully a quicker version - A euclidean distance is used in the feature space CAVEAT : only approximate
nipy.neurospin.clustering.hierarchical_clustering.Ward_segment(G, feature, stop=-1, qmax=1, verbose=0)
Agglomerative function based on a topology-defining graph and a feature matrix. INPUT: - G the input graph (a topological graph essentially) - feature (G.V,dim_feature) array that yields some vectorial information related to the graph vertices - stop = -1: the stopping crterion if stop==-1, then no stopping criterion is used - qmax=1: the maximum number of desired clusters (in the limit of the stopping criterion) OUPUT: -u: a labelling of the graph vertices according to the criterion -cost the cost of each merge step during the clustering procedure NOTE: - A euclidean distance is used in the feature space - caveat : when the number of cc in G (nbcc) is greter than qmax, u contains nbcc values, not qmax !
nipy.neurospin.clustering.hierarchical_clustering.Ward_simple(feature, verbose=0, quick=1)

Ward clustering based on a Feature matrix

t = Ward(feature,verbose=0): INPUT: - feature: a (n,p) feature matrix between some items representing n p-dimenional items to be clustered - verbose=0, verbosity level OUTPUT: -t a weightForest structure that represents the dendrogram of the data NOTE: this method uses the optimized C routine if “quick” is true.

nipy.neurospin.clustering.hierarchical_clustering.fusion(K, pop, i, j, k) modifies the graph K to merge nodes i and j into nodes k the similarity values are weighted averaged, where pop[, i], and pop[, j], yield the relative weights. this is used in average_link_slow (deprecated)