Associate Professor Marcus Frean

School of Engineering and Computer Science
Victoria University of Wellington

machine learning

affinity propagation of opinions for hashmapd?

finding circles

This is cooking along nicely, with Chris Hollitt, who knows Hough Transform backwards.

A good MSc project, I imagine. Conceivably even a PhD, at a pinch. IF we're confident that the way hasn't been well trodden already.

get it going
test it to hell
push the assumptions back, creating better models, EG:

ellipses
better background models, like blobs or gradients
finding multiple overlapping circles

Carrying on with the Configuration Variation Eliminatizer!

James Bebbington's MSc is really cool - so what's next?

drawing.png: part of the Big Pic on the Bebbington thesis

implement in a real robot
other ideas to extend / proof-of-concept it

generative modeling

Oct 2010.

I've been thinking about boosting ideas in the context of "constructive RBMs", but this might be more general than that.

The essential problem of learning a generative model:

a 'training set' is a list of vertices of a hypercube, and we assume this list is massively shorter than the total number of vertices. Call these the 'data vertices', or just 'v'.
for simplicity suppose there are no exact replications in the training set
our model-learning job is to simply to maximize the sum of the log probabilities of the data vertices
the gradient calculation says: for each data vertex v,
1. sample h from P(h|v), giving a x1 = (v1,h1) pair
2. take another sample, from just the model joint, giving a x2= (v2,h2)
3. steal some probability mass from x2 and give it to x1

It's just like each data point has its own "agent", who's job it is to steal probability mass and send it home. The agents just ignore each other, and go where the model probability is high, in their search. With CD1, we only let them search very nearby, which seems terrible as they'll never steal probability mass from further afield. But because the model isn't super-powerful, stealing it from a nearby point "to the left" means stealing it from all points to the left, even those further afield. Hence CD1 works better than you might think on first viewing.

But methinks we know something more, that we're not using in this algorithm: At least for an RBM, we can easily go through the training set and calculate the probability mass currently at each data vertex. Can't we use this somehow???!

Actually I don't know what I meant above. Firstly, for an RBM, you can't actually calculate p(v,h) as you don't know normalisation Z. Secondly, how do you imagine "using it" - the probs (posterior, and joint) have both been "used" already in the sense of samples being taken from them.

emergent "things" (ie. limited explaining away in rich representations).

Deep belief nets are all very fine, but they're based on RBMs, and RBMs, being undirected graphical models, DON'T have a generative model that involves high-level constructs that are independent, a.k.a. "things". Instead, everything in the hidden layer of an RBM is dependent on everything else in the prior (i.e. the generative model), and only becomes independent once you observe the visible units.

RBM: learn by truncated Gibbs sampling, in two phases (clamped and unclamped).

SBN: learn by EM with Gibbs sampling in the clamped phase only, i.e. from P(h|v). Gibbs sampling in this phase is achieved by almost the same as happens in an RBM (i.e. sigmoid function of input to h from v), but visible units ONLY contribute to the weighted sum when their input from OTHER hiddens is in DISAGREEMENT with their state. (this is an approximation to true Gibbs sampling, treating log(1+exp(z)) as a piece-wise "hinge".

This makes intuitive sense: the hidden unit only needs to "try to be a cause" for v when other hidden units fail as explanations (fail to "explain away" this v).

The drawbacks of Gibbs sampling in an SBN are:

each unit has to calculate a different number for each of the hiddens it connects to, whereas in an RBM it's just the state v alone.
this sum involves other hidden units, so they all interact (as they must!! there seems no way they can "explain away" without interacting, afterall). And the practical effect of this is that interaction makes MCMC slow to mix
Since MCMC is slow to mix, both recognition and learning are going to be slow dynamical processes.

POSSIBILITY: don't have all hiddens interacting in this way, just a few, or (somehow...) let this low number emerge - preferably driven by the very tractability we're so worried about!!!

Do it at the Gibbs sampling level: visible units when sending to hiddens in pool A, gate their output based on the input they receive from pool B, and vice versa. That's the general notion I'm wondering about.

An Idea for an optimization algorithm

Genetic algorithms aren't ergodic: they don't search the entire space, even given forever.
Simulated Annealing runs on top of Metropolis, which is MCMC, which is ergodic.
But Metropolis mixes very slowly. Hybrid/Hamiltonian MC works much better, if you know the gradient.
Ergo (!), if you know the gradient (as in NN optimizations, say), you'd expect HMC to do better than almost any GA, after accounting for parallelization.
But what about optimizations without a gradient, ie. the main target of GAs anyway.

EGO uses a Gaussian process to choose new samples. But since it involves two intensive steps, we phrase it as a solution (only) worth pursuing for "expensive problems". The intensive steps are:

the matrix inversion, to be done with every new sample.
searching the model for an optimum. Fast-ish with conjugate gradient, but it's messy and big / hard. People naturally prefer the simplicity of GAs.

(There is also the issue of hyperparameters, but optimizing those is strictly something to be done occasionally, and for expensive data problems - it's like on-the-fly tuning of any GA's parameters, afterall).

As Rohit is fond of saying, the GA people have been using "surrogates" (I forget the term they use though) in place of real fitness evaluations. An example might be to use SA to optimize the surrogate function (which is pretty much what we do in EGO).

Because SA is slow, we'd want to run it for many inter-samples before taking a real sample. So again we're looking at intensive use of the surrogate. The question is, is there a very light-weight use of a surrogate, that gives an algorithm almost as dumbass as GAs?

Let's call GAs "Algorithm 0", because, and let's be honest here, GAs are the lowest of the low and I just can't bear to call them #1. A man's got to stand for something in this sweet ol' world or he's just a gosh darned varmint.

Strawman algorithm: We could take a real sample at every SA inter-sample, for instance. That means just using the Metropolis proposal distribution to get new GA samples, which is identical to mutation-only GA, but with the extra step of a (possible) Metropolis rejection in between proposal and real-sample-taking. Okay that's about as stupid as you can get, without being Alg 0 itself.

Next up: run SA for a couple more inter-samples... but not thousands. The danger here is that the surrogate doesn't hold far from the data. The longer you run the chain, the more chance you're building castles in the air. The "trouble" is that SA is so stupid that you really do need to run it quite a while on the surrogate (I imagine) before seeing any improvement over Alg 0.

My idea is to use a surrogate to produce gradient estimates that HMC then uses to generate new samples instead of SA.

ie. After every new sample:

use the partitioned inverse equations to add to C^inv (an N-by-N matrix, where N is the number of datapoints).
use C^inv to estimate the local gradient
feed that to Hybrid/Hamiltonian MC, which proposes a new sample point.

This is ergodic (like SA) but more efficient (than SA).

Advances on GPO/ EGO

joining the dots

Notice that

The very final "sample point" (x) returned to the user should be the point with highest mean prediction, perhaps with some penalty for variance (ie. risk aversion) thrown in. If the penalty is high, we'll return the highest point actually found. PURE EXPLOIT
the penultimate point should be that with maximum expected improvement. And we know that's basically breaking into two bits, a mean lover and a variance lover (at high mean), added together. EXPLOIT + EXPLORE
the main argument / interest then is what to do for the samples before this.
in the limit that you're taking early samples and expect to be able to take ZILLIONS MORE, I guess the best thing to do is to be a variance-lover, and ignore the mean. PURE EXPLORE

Can we join these dots?!

non-homogeneous noise distribution?

Also:.... I don't think we've yet coped with truly noisy samples right, in the sense that I misunderstood "variance" all along...! The GP doesn't seem to tell you the variance of the samples - it's (just) the variance of the MEAN. Not the same! But Variance of actually sampling process IS important! This seems to never have been addressed.

superpositions of covariance functions

real data often looks like a sum of at least two GPs. How does that affect inference and learning? Can this improve GPO further?

use the variance of EI

There is a posterior distribution over "improvement" that is Gaussian. So far, we've used the expected (mean...) improvement, but haven't thought about the uncertainty in this - the variation in the improvement we expect. Given two input vectors with the same EI, but with different "variances" on the improvement (a definition of this would help, for starters!), I bet you'd prefer one over the other. And perhaps which it is would depend on how many MORE points you expect to be able to sample.

I feel in my bones that there's a connection between this and the "future aware" notion.

GPO for tracking

Is this a possible completely new way to do "particle" methods (PF, PSO), for tracking? Think this through...

Sim, seems like it might be a good fit for Wireless sensor networks optimization problems. Seriously...

hashmapd's labels problem

problem here is we want the answer yesterday...

But it's full of interesting theory: surely there are multiple possible approaches. Would we need to embargo / protect the results? Yes, probably. Think about this, talk over with Miles and Ed.

tSNE

A good MSc project: try out tSNE "annealing" from high dimensions down to 2 (or 1): can it help avoid local optima in the resulting mapping?

find two good problem sets
find a regime where tSNE vanilla is getting stuck all the time
fix the stuck-ness by the extra-dimensional bypass!

STRANDBEEST

Could look to evolve (how else?!) the graph, and optimize (any way...) lengths for similar. Include tendons with elasticity.... Mike Paulin's comment that muscles (mainly) reconfigure elasticity/responses in effect, rather than "just doing work". Seems fertille ground for optimization, and a cool thing to build at the end. Physics simulation engines can now be taken for gratned. Then hope to build it as an hons project in 2012.

theoretical biology

That Paper with Hartley

He said he'd do stuff. he never did. I want to submit it anyway. Need to decide WHEN I'm actually going to do this.

stag hunts

Want the payoffs to depend critically on relative sizes of own vs other groups, to get scale-up. Or perhaps it's not fractional sizes but absolute advantages that matters? Dunno. Need to think harder and larger, on a much bigger canvas..........

rate of evolution on graphs

Paul Rainey, Arne Taulsen, and me!

Print version | Backlinks |

Users/Marcus

RSS FeedBack to top ^

Valid XHTML and CSS | Built on Foswiki