Notes
Outline
Characterising Genes From
Gene-array Data
Peter Andreae
Victoria University of Wellington
Jude Shavlik, Michael Molla
University of Wisconsin, Madison
Key Ideas
New algorithm for finding characterisations of data
Bottom-up / data-driven
Efficient clustering technique
Characteristic and Discriminant rules
Non-standard evaluation criterion
Testing statistical significance
Permutation test
Outline
Task
Making sense of microarray data
Rule finding algorithm
MAXCCLUS
Evaluation criterion
Permutation test
Discussion
Problem Domain
Microarray experiments
eg, Antibiotic shock on E. coli
Experimental Data
Categorise each gene as
UP regulated  (ratio > 2)   or
DOWN regulated   (ratio < 1/2) or
??:  ambiguous or error
Text Data
Gene descriptions from text databases.
SwissProt
Database about proteins
Gene described by relevant fields for the protein that the gene codes for.
Riley
Database of functional classifications of E. coli genes
Gene described by set of words
Words unique to a single gene removed
Task
Characterise UP and DOWN genes
Goal:
Assist scientist to interpret data
Task: example
Properties of Domain
Small population of instances  (1000's)
vs small training set from large population
Outline
Task
Making sense of microarray data
Rule finding algorithm
MAXCCLUS
Evaluation criterion
Permutation test
Discussion
First Approaches
(Molla and Shavlik)
Naive Bayes, plus pruning
PFOIL, plus pruning
Evaluation by predictive accuracy
on held out test set
Some Problems
Naive Bayes
good accuracy
ignores dependencies
discriminant model, not characteristic
MAXCCLUS
Data Driven
vs  top-down search of hypothesis space
MAXCCLUS: Overview
MAXCCLUS: Clustering
For each pair of instances
find set of words common to instances
mark with length of common description
For each length from longest to 1
For each marked pair of this length
form a cluster
add other matching instances to cluster
unmark all pairs in cluster
(Builds from most specific for efficiency)
Clustering:   Match pairs
Clustering:   Match pairs
Clustering:   Construct Clusters
Clustering:   Construct cluster
Clustering:   Construct cluster
Clustering:   Construct cluster
Clustering:   Extend Cluster
Maximal Clusters
Finds Maximal Conjunctive Clusters
Conjunctive:
cluster specified by conjunction of features
Maximal description:
cannot add any more features
Maximal set:
cannot add any more instances
Clusters determined only by text data:
ignores categories from experimental data
MAXCCLUS: Overview
Select Good Clusters
Cluster = instances with common features
but many clusters irrelevant to experiment
many clusters not meaningful
Select Clusters
Select Clusters
MAXCCLUS: Overview
MAXCCLUS: Select Cover
Typically have overlapping, redundant good clusters
MAXCCLUS: Select Cover
Select subset of good clusters that covers same instances
MAXCCLUS: Select Cover
Select subset of good clusters that covers same instances
Greedy algorithm
Choose cluster that covers the most uncovered instances
MAXCCLUS: Select Cover
Select subset of good clusters that covers same instances
Greedy algorithm
Choose cluster that covers the most uncovered instances
MAXCCLUS: Select Cover
Select subset of good clusters that covers same instances
Greedy algorithm
Choose cluster that covers the most uncovered instances
MAXCCLUS: Select Cover
Select subset of good clusters that covers same instances
Greedy algorithm
Choose cluster that covers the most uncovered instances
MAXCCLUS: Select Cover
Select subset of good clusters that covers same instances
Greedy algorithm
Choose cluster that covers the most uncovered instances
MAXCCLUS: Overview
Finding Rules
Construct rules with
“necessary” words
“sufficient” words
“supplementary” words
Finding Rules: Algorithm
For each instance NOT in cluster:
find set of words in cluster but not in instance
Finding Rules: Example
UP:  A B H K
Rules
Contain characteristic and discriminant components
Accuracy determined by user parameter
Confidence determined by cluster selection
Outline
Task
Making sense of microarray data
Rule finding algorithm
construct clusters
select clusters
builds rules
Evaluation criterion
Permutation test
Discussion
Significance: The Problem
Significance
Need to distinguish:
Clusters that result from
statistical chance
from
Clusters that represent a
genuine relationship between
features and category.
Approaches to Significance
Accuracy on held-out test set
Good rules will be predictive on items from the same population
 Statistical significance (Molla)
What is the probability that such a rule would have been found if there was no genuine  association?
Must consider space of all possible rules
Advantages of Significance Test
Measures the desired property directly
Appropriate when entire population is known
Appropriate for low coverage rules
Doesn’t  “waste”  training instances
Significance Test
Pick a candidate minimum cluster size to test
 Ž set of possible clusters (good and bad)
Significance Test
Compute probability analytically
Hard: distribution of clusters is complex
Compute probability stochastically:
Pick minimum size.
Try many random category assignments
Count number of trials with a pure cluster at or above minimum size.
(Permutation test)
Permutation Test: Example
Permutation Test: Example
Permutation Test: Example
Permutation Test: Example
Getting the Test Right
Null hypothesis:
Categories are unrelated to the text features
Cluster space is independent of the permutation
Clusters determined only by the text features and minimum size parameter
Number of Trials
User sets desired confidence Ž ptarg
Need enough trials to be sure that
pest > ptarg   or   pest< ptarg
More trials Ž better estimate
Bernoulli experiment  Ž estimate standard error
Repeat until estimate is confidently below/above target
Outline
Task
Making sense of microarray data
Rule finding algorithm
clusters
selects
builds rules
Evaluation criterion
Permutation test
Discussion
Results
Applied to one microarray experiment
Text from Swissprot and/or Riley
Applied to genes and to operons
Rules are comparable to those of PFOIL
Fast
Results 1
All text data:
90% accuracy and 95% confidence
Found 55915 Clusters
Confidence required clusters > 34 instances
23 clusters in cover
Covered 502 genes out of 907
451 correctly
» 78 seconds per run on laptop
» 52 seconds for permutation test
3000 iterations sufficient
Results 2
SwissProt data:
90% accuracy and 95% confidence
Found 34165 Clusters
Confidence required clusters > 27 instances
21 clusters in cover
Covered 368 genes out of 772
330 correctly
» 260 seconds per run on laptop
» 246 seconds for permutation test
30,000 iterations sufficient
Comments
Rules from limited hypothesis space
no negations
strictly conjunctive
not all possible maximal conjunctive clusters
Original intention was not to construct rules!
goal = construct good seed rules for a rule-space searcher.
Significance test changes for larger space
Future Directions
Extend to handle negations
Apply to more experimental data
Feed output to a rule-space searcher.
Extend to all maximal conjunctive clusters (?)
Extend to a 1st order domain
“waterfall” effect is stronger
rule space is larger
matching instances may generate good seeds
Conclusions
Fast, data-driven algorithm to construct characteristic and discriminant rules
Desired accuracy and confidence specified by user
Applicable to small data sets, large descriptions
Permutation test to select rules with required confidence
Applicable to other tasks.
Generates good seeds for hypothesis space search
Questions …
Some Rules
DN:  IF (and a) AND  (ec@1 OR oxidoreductase)
DN:  IF (biosynthesis step)
DN:  IF (cytoplasmic biosynthesis)
DN:  IF (the step)
DN:  IF (to step)
UP:  IF ((intergenic OR region) AND (precursor OR signal))
UP:  IF (e.coli) AND (intergenic OR (hypothetical region))
UP:  IF (region the transmembrane)
UP:  IF (hypothetical region to the)
UP:  IF (hypothetical membrane)
UP:  IF (in membrane potential)
UP:  IF (hypothetical belongs)
UP:  IF (of transcription)
UP:  IF (protein of regulation) UP:  IF (protein transcription)
UP:  IF (hypothetical by) UP:  IF (recombination)
UP:  IF (dna-binding) UP:  IF (dna for)
UP:  IF (element) UP:  IF (insertion)