 
 
 
 
 
 
 
  
 2340 word-by-document matrix consists of the
non-normalized occurrence frequencies of stemmed words, using Porter's
suffix stripping algorithm [Fra92].  Pruning all words that
occur less than 0.01 or more than 0.10 times on average because
they are insignificant (e.g., haruspex) or too generic (e.g.,
new), respectively, results in
 2340 word-by-document matrix consists of the
non-normalized occurrence frequencies of stemmed words, using Porter's
suffix stripping algorithm [Fra92].  Pruning all words that
occur less than 0.01 or more than 0.10 times on average because
they are insignificant (e.g., haruspex) or too generic (e.g.,
new), respectively, results in  . We call this
data-set YAHOO (see also appendix A.5).
Let us point out some worthwhile differences between clustering
market-baskets and documents. Firstly, discrimination of vector length
is no longer desired since customer life-time value matters but
document length does not. Consequently, we use cosine similarity
. We call this
data-set YAHOO (see also appendix A.5).
Let us point out some worthwhile differences between clustering
market-baskets and documents. Firstly, discrimination of vector length
is no longer desired since customer life-time value matters but
document length does not. Consequently, we use cosine similarity
 instead of extended Jaccard similarity
 instead of extended Jaccard similarity
 . Also, in document clustering we are less concerned
about balancing, since there are usually no direct monetary costs of
the actions derived from the clustering involved.  As a consequence of
this, we over-cluster first with sample-balanced OPOSSUM
and then allow user guided merging of clusters through CLUSION.
The YAHOO data-set is notorious for having some diffuse groups
with overlaps among categories, a few categories with multi-modal
distributions, etc. These aspects can be easily explored by looking at
the class labels within each cluster, merging some clusters and then
again visualizing the results.
Figure 3.6 shows clusterings with three settings of
. Also, in document clustering we are less concerned
about balancing, since there are usually no direct monetary costs of
the actions derived from the clustering involved.  As a consequence of
this, we over-cluster first with sample-balanced OPOSSUM
and then allow user guided merging of clusters through CLUSION.
The YAHOO data-set is notorious for having some diffuse groups
with overlaps among categories, a few categories with multi-modal
distributions, etc. These aspects can be easily explored by looking at
the class labels within each cluster, merging some clusters and then
again visualizing the results.
Figure 3.6 shows clusterings with three settings of
 .  For
.  For  (figure 3.6(a)) most clusters are
not dense enough, despite the fact that the first two clusters already
seem like they should not have been split.  After increasing to
 (figure 3.6(a)) most clusters are
not dense enough, despite the fact that the first two clusters already
seem like they should not have been split.  After increasing to  (figure 3.6(b)), CLUSION indicates that
the clustering now has sufficiently compact clusters. Now, we
successively merge pairs of highly related clusters until we obtain
our final clustering with
(figure 3.6(b)), CLUSION indicates that
the clustering now has sufficiently compact clusters. Now, we
successively merge pairs of highly related clusters until we obtain
our final clustering with  (figure
3.6(c)). The merging process is guided by
inter-cluster similarity (e.g., bright off-diagonal regions) augmented
by cluster-descriptions (e.g., related frequent words). In fact, in
our graphical user interface of CLUSION merging is as easy
as clicking on a selected off-diagonal region.
 (figure
3.6(c)). The merging process is guided by
inter-cluster similarity (e.g., bright off-diagonal regions) augmented
by cluster-descriptions (e.g., related frequent words). In fact, in
our graphical user interface of CLUSION merging is as easy
as clicking on a selected off-diagonal region.
| 
 | ||||||||||
| 
 | 
 ) is evaluated using the dominant category
(
) is evaluated using the dominant category
(
 ), purity (
), purity (
 ), and
entropy (
), and
entropy (
 ).
Let
).
Let 
 denote the number of objects in cluster
 denote the number of objects in cluster
 that are classified to be in category
 that are classified to be in category  as
given by the original Yahoo! categorization.  Cluster
 as
given by the original Yahoo! categorization.  Cluster
 's purity can be defined as
's purity can be defined as
|  | (3.9) | 
 categories as
 categories as
|  | (3.10) | 
 with 483
out of 528 documents being from the health cluster. Health related
documents show a very distinct set of words and can, hence, be nicely
separated.  Small and not well distinguished categories have been put
together with other documents (For example, the arts category has
mostly been absorbed by the music category to form clusters 14 and
16.). This is inevitable since the 20 categories vary widely in size
from 9 to 494 documents while the clusters OPOSSUM provides
are much more balanced (from 58 to 528 documents per cluster).
 with 483
out of 528 documents being from the health cluster. Health related
documents show a very distinct set of words and can, hence, be nicely
separated.  Small and not well distinguished categories have been put
together with other documents (For example, the arts category has
mostly been absorbed by the music category to form clusters 14 and
16.). This is inevitable since the 20 categories vary widely in size
from 9 to 494 documents while the clusters OPOSSUM provides
are much more balanced (from 58 to 528 documents per cluster).
 
 
 
 
 
 
