Maxcclus Project

Maxcclus project for 2003

There are two parts to this project:

Using the maxcclus algorithm to obtain useful characterisations of the micro array data from Wisconsin.
Improving the algorithm.

Processing Data

The experimental data gives a normalised log ratio of the regulation level for each gene. So far, we have used simple thresholding to identify the clearly up-regulated genes and the clearly down-regulated genes, and found characterisations of just these genes.

We have used the SwissProt data base by extracting certain fields, converting the descriptions of each gene into word sets, and doing varying degrees of stemming on the words, and some limited elimination of the most frequent function words.

There are several other variations that we should consider:

We need to experiment to find the effect of more or fewer trials in determining the confidence level.
Including the "unchanged" genes, along with the up- and down- regulated ones. We could use thresholds of
<-1.0 = down
> -0.5 and < 0.5 = nochange
> 1.0 = up
This would give us three categories instead of just two.
Genes come in groups (operons) that are transcribed together so that the regulation of a gene is really the regulation of the operon containing that gene. Previous work on just a single set of experimental data suggested that many of the significant clusters that Maxcclus found were largely due to the clusters of genes in operons, and there were far fewer significant clusters of operons. We should run the algorithm on operons with the larger set of experimental data.
We need to experiment with different levels of stemming, and perhaps with smarter stemming algorithms (the porter stemmer does not use a dictionary; there may be better stemmers available).
One problem is determining whether a better stemmer gives better results or not. We have to work out a way of evaluating the results.
Jude Shavlik wanted to do minimal modifcation of the text database, so the current system does minimal elimination of non-meaningful words. I believe that cleaning up the vocabulary more aggressively might give better performance.
The SwissProt database may have been expanded since the version we have been using. I have a more up-to-date copy; we need to process it into the appropriate text files.

Extending the Algorithm

There are a number of limitations to the current algorithm that need to be addressed.

The algorithm used to run trials until it was quite certain of the minimum size cluster for the desired confidence level. We have changed the algorithm to run a fixed number of trials and report the minimum cluster size that it knows meets the desired confidence level (even if it is possible that a lower size would also meet it). We can make the algorithm more flexible by reporting the confidence level for each size. This would enalbe to final report to include the confidence levels for each of the final clusters.
The current algorithm for rule construction uses a greedy covering algorithm to select clusters to make rules out of. This algorithm may be missing important rules. The generalisation algorithm generalises the clusters from the cover (I think). It would seem to be a better idea to generalise the clusters first before finding a cover, and to use a non-greedy covering algorithm. We can still remove subsumed clusters before generalising.
I am concerned that when the covering algorithm makes a choice between two clusters, it is effectively eliminating a possible characterisation of some set of instances on the basis of whether the instances have already been included in a different characterisation. However, that seems unreliable - maybe we need all the characterisations, or maybe we need all the significantly different characterisations: we should eliminate a cluster on the basis of the similarity of its characterisation to other characterisations rather than on the basis of whether its instances are included in another characterisation.
The current generalisation algorithm is deterministic and only considers dropping descriptors. We should experiment with search based generalisation algorithms that consider adding and dropping descriptors.