CosmoCoffee

Posted: **June 02 2006**

This paper discusses the use of Nested Sampling for cosmological model selection and parameter analysis. This is certainly a neat method, and nice to see it being put to use.

I hope to discuss some of these things at the Sussex conference next week. A couple of things to consider:

A. Intermediate steps in this method require sampling from the prior P(x) conditional on the likelihood L(x) being greater than a particular value L_j. To implement this sampling the authors seem to

1. Construct an ellipse than encloses the current set of samples
2. Expand the ellipse by some constant to allow for the fact that true likelihood contours are not elliptical
3. Importance sample within this expanded ellipse (i.e. take samples until one sample lies in the L(x) > L_j region).

This is in contrast to the original Skilling method where he suggests using MCMC for this sampling step. My worry about the method used here as a general method is the following. In n-dimensions, the volume of a sphere of radius r is ~ r^n. If the sphere is expanded by a factor f, the fraction of the new volume contained in the target volume is (1/f)^n, i.e. the acceptance rate is approximately 1/f^n and the number of likelihood calculations to get the new sample is ~ f^n. For n=5, f=1.5 this gives an acceptance rate ~ 0.13, similar to the 20% they report. For n=7, f=1.8 it gives a rate of 0.016. Since f^n grows exponentially with dimension, this sampling method would seem to get exponentially bad in high dimensions. i.e. perhaps for n>~8 using MCMC would be a much better idea (though maybe fine for n<=8 considered in this paper). Also expanding the elliptical region is of course not guaranteed to enclose all points in the L(x) > L_j region, so the method is not formally correct, and can only be tested in each case by expanding my a much larger factor (very slow!) and checking for stability of the result.

B. Is it useful to use different 'priors'? e.g.

[tex]
\int d\theta L(\theta)P(\theta) =\int d\theta \frac{L(\theta) P(\theta)}{P'(\theta)} P'(\theta),
[/tex]

so one could perhaps apply this method sampling from P'(\theta) rather than the original P(\theta) [and using the appropriately modified likelihood]. i.e. perhaps one could use some approximation to L^{\beta} that can be quickly normalized.

Posted: **June 04 2006**

Antony Lewis wrote:This is in contrast to the original Skilling method where he suggests using MCMC for this sampling step.

Do you know of any reference on how this actually is done? I've been looking through Skilling's page, and there are a number of papers/tutorials/manuals there describing the idea (which is neatly summarized by Mukherjee et al.), but for some reason I can't find any explicit description of a good sampling algorithm, just general suggestions. And the BayeSys program is a little too massive that I want to spend too much time on figuring out how it's done there. Multi-dimensional Hilbert curves, atom creation and bit swapping sounds complicated.. :-)

Thanks!

Posted: **June 04 2006**

I should think you could use almost anything for simple cases, e.g. just the standard CosmoMC sampler (see the CosmoMC paper, notes and readme for references; basically Metropolis-Hastings with Gaussian jumps in random 2D subspaces).

On the other hand I'm not at all sure how well nested sampling would work on a general distribution. I'd have thought sampling from the sharply-truncated prior could give a very odd disjoint shape to sample from even if the posterior distribution is only weakly multimodal itself - in general very difficult. An advantage of using theromodynamic integration is that it least it is pretty robust to odd distribution shapes.

Posted: **June 05 2006**

Antony Lewis wrote:I should think you could use almost anything for simple cases, e.g. just the standard CosmoMC sampler (see the CosmoMC paper, notes and readme for references; basically Metropolis-Hastings with Gaussian jumps in random 2D subspaces).

This is the point I guess I don't understand. The proposals are supposed to be drawn from the prior, right? So with a uniform prior, an MCMC chain is basically just a random walk -- with the additional constraint that you never go below L_low (= least likely of the N live points). My problem with this is that it seems you would have to make many MC steps to diffuse sufficiently far away from the original point, and for each of these steps you would have to compute a (possibly expensive) likelihood just to make sure you haven't gone outside the allowed region. So my problem is that I don't quite see how an MCMC method is going to be effective..? And as you say, the multi-dimensional spheroid approach of Mukherjee et al. also have problems in high dimensions because of the expansion factor..

Or am I missing something fundamental here..?

Posted: **June 07 2006**

I think that is right - MCMC will certainly involve lots of likelihood evaluations. One could however hope that it would scale with dimension less badly than exponentially.

On the other hand David Mackay pointed out that for a n-D ellipse, almost all the probability volume is around the surface, so it is indeed also very hard for MCMC to find a new sample. He suggested starting chains at one of the just-rejected points - which would probably be valid for some class of unimodal distributions, though certainly not generally.

In general it seems to me that nested sampling is difficult in high dimensions, and may indeed not be better than thermodynamic integration - especially for odd-shaped distributions.

Posted: **June 07 2006**

We tried MCMC-ing (from any one of the remaining points, or from the last rejected point) and found that in general there remained a bias in the answer. In comparison, the sampling from the spheroid scheme was efficient and gave a very accurate and unbiased answer. True, this scheme for finding new points will become less efficient with increasing dimentionality.

As regards comparison with thermodynamic integration (TI), we seemed to require orders of magnitude fewer likelihood evaluations to reach a given final accuracy (based on TI requiring ~10^7 likelihood evaluations to get to \delta(ln E) of 0.1 for a 5 or 6 parameter model. This is based on estimates from Beltran et al. and private communications. Whether TI can be made more efficient, through more tuned thermodynamic scheduling, I dont know.

Posted: **June 08 2006**

Hi there,

I think it is pretty generic that MCMC won't work to generate new live points, which is what we found in testing various methods.

The challenge is to sample **uniformly** from the remaining prior, which becomes a small part of the parameter volume as the code progresses. With an MCMC method, the samples will only find (or remain in) the high likelihood region if the sampler is given knowledge of what the likelihood is, but as soon as you do that the sampling stops being uniform and biases the answer.

It would be interesting to explore thermodynamic integration more extensively for large-dimensional problems, where I agree that nested sampling may scale badly. In Beltran et al astro-ph/0501477 we found that Therm Int was very inefficient for cosmological problems, and we subsequently found the same even for simple gaussian likelihoods. But we focussed on one particular annealing schedule and significant optimization may be possible.

best,

Andrew

Posted: **June 08 2006**

Andrew Liddle wrote: The challenge is to sample **uniformly** from the remaining prior, which becomes a small part of the parameter volume as the code progresses. With an MCMC method, the samples will only find (or remain in) the high likelihood region if the sampler is given knowledge of what the likelihood is, but as soon as you do that the sampling stops being uniform and biases the answer.

I'm not sure I understand this. The task is to sample from P*(x), which is equal to P(x) for L(x) > L_j and equal to zero for L(x)<L_j. This is a perfectly well defined function that is flat in the region you want and zero outside (so the chain never moves outside). The likelihood has to be evalutated at each step to see if L(x)>L_j, but the value of the likelihood is not used for anything else. I don't see how this cannot work as long as you have enough MCMC steps that it converges? (which, admittedly, may be a lot!)

David MacKay also suggested reading about multicanonical sampling - this looks potentially interesting (e.g. I found cond-mat/9909236), esp. for multi-modal distributions.

Posted: **June 11 2006**

Dear Antony,

Yes, you're right. That should work but the number of MCMC steps needed to converge might indeed be quite large. Worth testing.
Your proposal is somewhat different from the failed MCMC attempts we tried.

best,

Andrew

CosmoCoffee

[astro-ph/0508461] A Nested Sampling Algorithm for Cosmological Model Selection

[astro-ph/0508461] A Nested Sampling Algorithm for Cosmologi

Re: [astro-ph/0508461] A Nested Sampling Algorithm for Cosmo

Re: [astro-ph/0508461] A Nested Sampling Algorithm for Cosmo

Re: [astro-ph/0508461] A Nested Sampling Algorithm for Cosmo

Re: [astro-ph/0508461] A Nested Sampling Algorithm for Cosmo

[astro-ph/0508461] A Nested Sampling Algorithm for Cosmologi

[astro-ph/0508461] A Nested Sampling Algorithm for Cosmologi

Re: [astro-ph/0508461] A Nested Sampling Algorithm for Cosmo

[astro-ph/0508461] A Nested Sampling Algorithm for Cosmologi