|
|
|
Peter Andreae |
|
Victoria University of Wellington |
|
|
|
Jude Shavlik, Michael Molla |
|
University of Wisconsin, Madison |
|
|
|
|
|
New algorithm for finding characterisations of
data |
|
Bottom-up / data-driven |
|
Efficient clustering technique |
|
Characteristic and Discriminant rules |
|
Non-standard evaluation criterion |
|
Testing statistical significance |
|
Permutation test |
|
|
|
|
|
Task |
|
Making sense of microarray data |
|
Rule finding algorithm |
|
MAXCCLUS |
|
Evaluation criterion |
|
Permutation test |
|
Discussion |
|
|
|
|
|
Microarray experiments |
|
eg, Antibiotic shock on E. coli |
|
|
|
|
|
|
|
Categorise each gene as |
|
UP regulated
(ratio > 2) or |
|
DOWN regulated
(ratio < 1/2) or |
|
??:
ambiguous or error |
|
|
|
|
|
|
|
|
|
|
Gene descriptions from text databases. |
|
SwissProt |
|
Database about proteins |
|
Gene described by relevant fields for the
protein that the gene codes for. |
|
Riley |
|
Database of functional classifications of E.
coli genes |
|
Gene described by set of words |
|
Words unique to a single gene removed |
|
|
|
|
|
|
|
Characterise UP and DOWN genes |
|
Goal: |
|
Assist scientist to interpret data |
|
|
|
|
|
|
Small population of instances (1000's) |
|
vs small training set from large population |
|
|
|
|
|
Task |
|
Making sense of microarray data |
|
Rule finding algorithm |
|
MAXCCLUS |
|
Evaluation criterion |
|
Permutation test |
|
Discussion |
|
|
|
|
(Molla and Shavlik) |
|
|
|
Naive Bayes, plus pruning |
|
PFOIL, plus pruning |
|
Evaluation by predictive accuracy
on held
out test set |
|
|
|
|
|
Naive Bayes |
|
good accuracy |
|
ignores dependencies |
|
discriminant model, not characteristic |
|
|
|
|
|
Data Driven |
|
vs top-down
search of hypothesis space |
|
|
|
|
|
|
|
For each pair of instances |
|
find set of words common to instances |
|
mark with length of common description |
|
For each length from longest to 1 |
|
For each marked pair of this length |
|
form a cluster |
|
add other matching instances to cluster |
|
unmark all pairs in cluster |
|
(Builds from most specific for efficiency) |
|
|
|
|
|
|
|
|
|
|
|
|
|
Finds Maximal Conjunctive Clusters |
|
Conjunctive: |
|
cluster specified by conjunction of features |
|
Maximal description: |
|
cannot add any more features |
|
Maximal set: |
|
cannot add any more instances |
|
Clusters determined only by text data: |
|
ignores categories from experimental data |
|
|
|
|
|
|
Cluster = instances with common features |
|
but many clusters irrelevant to experiment |
|
many clusters not meaningful |
|
|
|
|
|
|
|
Typically have overlapping, redundant good
clusters |
|
|
|
|
Select subset of good clusters that covers same
instances |
|
|
|
|
|
Select subset of good clusters that covers same
instances |
|
Greedy algorithm |
|
Choose cluster that covers the most uncovered
instances |
|
|
|
|
|
Select subset of good clusters that covers same
instances |
|
Greedy algorithm |
|
Choose cluster that covers the most uncovered
instances |
|
|
|
|
|
Select subset of good clusters that covers same
instances |
|
Greedy algorithm |
|
Choose cluster that covers the most uncovered
instances |
|
|
|
|
|
Select subset of good clusters that covers same
instances |
|
Greedy algorithm |
|
Choose cluster that covers the most uncovered
instances |
|
|
|
|
|
Select subset of good clusters that covers same
instances |
|
Greedy algorithm |
|
Choose cluster that covers the most uncovered
instances |
|
|
|
|
|
|
Construct rules with |
|
“necessary” words |
|
“sufficient” words |
|
“supplementary” words |
|
|
|
|
|
For each instance NOT in cluster: |
|
find set of words in cluster but not in instance |
|
|
|
|
|
Contain characteristic and discriminant
components |
|
Accuracy determined by user parameter |
|
Confidence determined by cluster selection |
|
|
|
|
|
Task |
|
Making sense of microarray data |
|
Rule finding algorithm |
|
construct clusters |
|
select clusters |
|
builds rules |
|
Evaluation criterion |
|
Permutation test |
|
Discussion |
|
|
|
|
|
|
Need to distinguish: |
|
Clusters that result from
statistical
chance |
|
from |
|
Clusters that represent a
genuine
relationship between
features and category. |
|
|
|
|
|
Accuracy on held-out test set |
|
Good rules will be predictive on items from the
same population |
|
Statistical significance (Molla) |
|
What is the probability that such a rule would
have been found if there was no genuine
association? |
|
Must consider space of all possible rules |
|
|
|
|
|
|
Measures the desired property directly |
|
Appropriate when entire population is known |
|
Appropriate for low coverage rules |
|
Doesn’t
“waste” training instances |
|
|
|
|
|
Pick a candidate minimum cluster size to test |
|
Ž set of
possible clusters (good and bad) |
|
|
|
|
|
Compute probability analytically |
|
Hard: distribution of clusters is complex |
|
Compute probability stochastically: |
|
Pick minimum size. |
|
Try many random category assignments |
|
Count number of trials with a pure cluster at or
above minimum size. |
|
(Permutation test) |
|
|
|
|
|
|
|
|
|
Null hypothesis: |
|
Categories are unrelated to the text features |
|
Cluster space is independent of the permutation |
|
Clusters determined only by the text features
and minimum size parameter |
|
|
|
|
|
User sets desired confidence Ž ptarg |
|
Need enough trials to be sure that
pest
> ptarg or pest< ptarg |
|
More trials Ž better estimate |
|
Bernoulli experiment Ž estimate standard error |
|
Repeat until estimate is confidently below/above
target |
|
|
|
|
|
Task |
|
Making sense of microarray data |
|
Rule finding algorithm |
|
clusters |
|
selects |
|
builds rules |
|
Evaluation criterion |
|
Permutation test |
|
Discussion |
|
|
|
|
Applied to one microarray experiment |
|
Text from Swissprot and/or Riley |
|
Applied to genes and to operons |
|
Rules are comparable to those of PFOIL |
|
Fast |
|
|
|
|
|
|
All text data: |
|
90% accuracy and 95% confidence |
|
Found 55915 Clusters |
|
Confidence required clusters > 34 instances |
|
23 clusters in cover |
|
Covered 502 genes out of 907 |
|
451 correctly |
|
» 78
seconds per run on laptop |
|
» 52
seconds for permutation test |
|
3000 iterations sufficient |
|
|
|
|
|
|
|
|
SwissProt data: |
|
90% accuracy and 95% confidence |
|
Found 34165 Clusters |
|
Confidence required clusters > 27 instances |
|
21 clusters in cover |
|
Covered 368 genes out of 772 |
|
330 correctly |
|
» 260
seconds per run on laptop |
|
» 246
seconds for permutation test |
|
30,000 iterations sufficient |
|
|
|
|
|
Rules from limited hypothesis space |
|
no negations |
|
strictly conjunctive |
|
not all possible maximal conjunctive clusters |
|
Original intention was not to construct rules! |
|
goal = construct good seed rules for a
rule-space searcher. |
|
Significance test changes for larger space |
|
|
|
|
|
Extend to handle negations |
|
Apply to more experimental data |
|
Feed output to a rule-space searcher. |
|
Extend to all maximal conjunctive clusters (?) |
|
Extend to a 1st order domain |
|
“waterfall” effect is stronger |
|
rule space is larger |
|
matching instances may generate good seeds |
|
|
|
|
|
Fast, data-driven algorithm to construct
characteristic and discriminant rules |
|
Desired accuracy and confidence specified by
user |
|
Applicable to small data sets, large
descriptions |
|
Permutation test to select rules with required
confidence |
|
Applicable to other tasks. |
|
Generates good seeds for hypothesis space search |
|
|
|
|
|
DN: IF
(and a) AND (ec@1 OR
oxidoreductase) |
|
DN: IF
(biosynthesis step) |
|
DN: IF
(cytoplasmic biosynthesis) |
|
DN: IF
(the step) |
|
DN: IF
(to step) |
|
|
|
UP: IF
((intergenic OR region) AND (precursor OR signal)) |
|
UP: IF
(e.coli) AND (intergenic OR (hypothetical region)) |
|
UP: IF
(region the transmembrane) |
|
UP: IF
(hypothetical region to the) |
|
UP: IF
(hypothetical membrane) |
|
UP: IF
(in membrane potential) |
|
UP: IF
(hypothetical belongs) |
|
UP: IF
(of transcription) |
|
UP: IF
(protein of regulation) UP: IF
(protein transcription) |
|
UP: IF
(hypothetical by) UP: IF
(recombination) |
|
UP: IF
(dna-binding) UP: IF (dna for) |
|
UP: IF
(element) UP: IF (insertion) |
|