Graduate Projects, Peter Andreae:

Clustering Algorithms

In previous work on clustering, I investigated algorithms for finding "clusters" or subclasses in large collections of data. More recently, I have worked on a different machine learning algorithm that incorporates some ideas from clustering, but applies it to a "supervised clustering" task where the goal is to characterise the data, rather than to classify or predict. Currently, I am looking at clustering data from the honeypot project in the Distributed Systems Group.

While on research leave at the University of Wisconsin, Madison (2001), I have used some clustering ideas to build rules/models of data from microarray ("gene-chip") experiments. The idea is to find out what is common to groups of genes that were "up-regulated" (or "down-regulated") in the experiment, in order to give the biologist help in identifying what was going on in the experiment.

The algorithm sits in between standard classification algorithms (like decision tree builders) and clustering algorithms. I have some data from a biochem/genenetic domain on which we can explore these ideas and there is the possiblity of collaboration with people from SBS in this area.

Recent student projects have ported the algorithm to work in the Weka frameword, (Jessica Kampmann, MCompSci) and investigated some extensions and modifications to this algorithm (Mukhlis Matti, MSc). This experiments have identified some strengths of the algorithm (the fast, cheap, generalisation strategy appears to be very effective) and some limitations (it does not scale up to data sets with many 1000's of items, or hard to distinguish classes).

You can view the slides from a presentation about this algorithm. There aren't any "speaker notes", and the animation is not visible, so I don't promise that it is entirely comprehensible!
You can also view a paper about a different approach within the same project at UW Madison.

The generalisation approach in the project avoids the limitations of the standard bottom-up and top-down generalisation, both of which are fatal for complex, relational data. The evaluation method is also promising for enabling learning in data sets with limited training and test data. A challenging MSc or PhD project would be to apply these two ideas to relational data, such as drug design or mining relational databases.

Some years ago, I developed a promising algorithm for clustering data based on the COBWEB algorithm. The algorithm processes the data incrementally, which may enable very large data sets to be clustered in roughly linear time.

Aaron Roydhouse looked at ways of extending the algorithm to deal with temporal data. This introduces complications because part of the description of a data point now includes the cluster that follows it in time. As the algorithm explores new clusterings, the descriptions of all the data points change! Handling this efficiently requires clever data structures and algorithms.