Friday, October 27, 2006

Cluster Ensembles

This past week I received the November 2006 issue of the IEEE Transactions on Pattern Analysis and Machine Intelligence, and found very interesting the article "Evaluation of Stability of k-Means Cluster Ensembles with Respect to Random Initialization". This is something that I have thought about, but (to my discredit) haven't read on or even experimented with beyond very simple case studies.

If it of course the logical extension of the ensemble techniques that have been used for the past decade. The method that I found most accessible was to (1) resample the data with bootstrap samples, (2) create k-means cluster models for each sample, and (3) use the cluster labels to associate with each record (at this point, you have R records, M fields used to build the clusters, and P cluster models, one new field for each model). Finally, you can built a hierarchical clustering model based on records and the new "P" fields.

More on this after some experiments.

3 comments:

Will Dwinnell said...

Have you anything of interest to report from your experiments? You may be interested in Refining Initial Points for K-Means Clustering (1998), by P. S. Bradley, Usama M. Fayyad.

Dean Abbott said...

not yet...I hope to have some results after rebuilding my poor disk-impaired laptop :)

Dean Abbott said...

The earliest paper I have found on the subject is a 1999 paper entitled "Bagged Clustering" (F. Leisch, "Bagged clustering", Working Papers SFB "Adaptive Information Systems and Modelling in Economics and Management Science", 51, Institut fr Informationsverarbeitung, Abt. Produktionsmanagement, Wien, 1999. http://citeseer.ist.psu.edu/leisch99bagged.html)--apparently I really missed the hubbub! I like the idea because clustering is inherently an unstable technique (meaning small changes in the data can product significant changes in the model), much like trees and naive bayes. Hopefully more this month...