 
 
 
 
 
 
 
  
Clustering can be posed as a graph partitioning problem.  The objects
are viewed as the set of vertices 
 .  Two documents
.  Two documents
 and
 and 
 (or vertices
 (or vertices  and
 and  ) are
connected with an undirected edge of positive weight
) are
connected with an undirected edge of positive weight
 , or
, or 
 .  The cardinality of the set of edges
.  The cardinality of the set of edges 
 equals the number of non-zero similarities between all pairs of
samples.  A set of edges whose removal partitions a graph
equals the number of non-zero similarities between all pairs of
samples.  A set of edges whose removal partitions a graph
 into
 into  pair-wise disjoint
sub-graphs
 pair-wise disjoint
sub-graphs 
 , is called an edge
separator.  The objective in graph partitioning is to find such a
separator with a minimum sum of edge weights.  While striving for the
minimum cut objective, the number of objects in each cluster has to be
kept approximately equal.  We produce balanced (equal sized) clusters
from the similarity matrix using the multi-level
graph partitioner METIS [KK98a].
The most expensive step in this
, is called an edge
separator.  The objective in graph partitioning is to find such a
separator with a minimum sum of edge weights.  While striving for the
minimum cut objective, the number of objects in each cluster has to be
kept approximately equal.  We produce balanced (equal sized) clusters
from the similarity matrix using the multi-level
graph partitioner METIS [KK98a].
The most expensive step in this 
 technique is
the computation of the
 technique is
the computation of the 
 similarity matrix. In document
clustering, sparsity can be induced by looking only at the
 similarity matrix. In document
clustering, sparsity can be induced by looking only at the  strongest edges or at the subgraph induced by pruning all edges except
the
strongest edges or at the subgraph induced by pruning all edges except
the  nearest-neighbors for each vertex.  Sparsity makes this
approach feasible for large data-sets. Sparsity is induced by
particular similarities definitions based e.g., on the cosine of
document vectors.
 nearest-neighbors for each vertex.  Sparsity makes this
approach feasible for large data-sets. Sparsity is induced by
particular similarities definitions based e.g., on the cosine of
document vectors.
 
 
 
 
 
 
