
CosmoCoffee

[0708.2989]
Fast optimal CMB power spectrum estimation with Hamiltonian sampling

Authors:  J. F. Taylor, M. A. J. Ashdown, M. P. Hobson 
Abstract:  We present a method for fast optimal estimation of the temperature angular
power spectrum from observations of the cosmic microwave background. We employ
a Hamiltonian Monte Carlo (HMC) sampler to draw samples directly from the
posterior probability of the power spectrum given a set of observations. We
demonstrate the method on simulations of the 3year WMAP data and find that it
performs well, even in regions of low signal to noise where related Gibbs
sampling approaches suffer from inefficiencies. Analysis of a WMAP sized data
set is possible in a day on a highend desktop computer. HMC imposes few
conditions on the distribution to be sampled and provides us with an extremely
flexible approach. 

[PDF]
[PS] [BibTex] [Bookmark]

View previous topic :: View next topic 
Author 
Message 
Hans Kristian Eriksen
Joined: 25 Sep 2004 Posts: 58 Affiliation: ITA, University of Oslo

Posted: August 23 2007 


This interesting method paper appeared on astroph today, and if everything checks out, then this certainly seems to be a promising technique! However, I'm a little puzzled by some of the results, at first glance. It may well be just plotting issues, but I thought I'd ask anyway.
First, the blue points and error bars in Figure 1 are claimed to show the posterior mean and 68% error bars computed from the Hamiltonian sampling. What puzzles me is the shape of the error bars in the region between l=10 and 100 – at l >~ 20, the true posterior distributions are quite Gaussian, and one should definitely not see any huge asymmetry between the upper and lower error bars at l = 50 or 100. And, yet, the error bars shown here are tremendously asymmetric, with the lower error bar almost coinciding with the mean for most points.
Second, Figure 3 shows the estimated posterior distributions for three selected multipoles, namely 15, 50 and 200. Here there is a rather odd feature, namely that the peak of the distributions appear to coincide with the *theoretical* spectrum used for generating the simulation, rather than the actual spectrum of this particular realization. This is also in poor agreement with theoretical expectations.
So, while the method certainly appears promising from a computational efficiency point of view, I'd also be interested in seeing a direct comparison with an analytical lowresolution case. As presented here, the results do appear somewhat unnatural. But again, it could be just plotting issues. 

Back to top 


Mark Ashdown
Joined: 27 Sep 2004 Posts: 3 Affiliation: University of Cambridge

Posted: August 23 2007 


Thanks for your comments.
The asymmetry in the error bars was caused by an error in plotting the colour version for astroph. The positions of the confidence limits were correct, but the positions of the means were not! In the the correct version the error bars are much more symmetrical.
Regarding the other issue about the position of the peak of the posterior. I think it is just an unlucky realisation (and choice of l values), but we are checking.... 

Back to top 


Hans Kristian Eriksen
Joined: 25 Sep 2004 Posts: 58 Affiliation: ITA, University of Oslo

Posted: August 23 2007 


Thanks for your reply! Also, now I've read the paper more carefully, I have a few additional questions I hope you can help me out with.
First, it seems to me that this is a fairly standard Metropolis algorithm, really, and the "Hamiltonian" part of this is really just a proposal density that provides fairly uncorrelated samples, right? If so, is it obvious that the proposal density T(x,x') is symmetric (T(x,x') = T(x',x)), so that one does not have to apply the more general MetropolisHastings rule in place of equation 14 in this paper? It's probably quite simple, but I don't see it right now..
Then a few other comments and questions:
1) I'm not sure I'd say that a GelmanRubin statistic of 1.2 represent "excellent convergence". I've seen too many weird results at this level of convergence, and for fast applications like CosmoMC, I usually prefer 1.01. Also for the Gibbs sampler, I require convergence at the few percent level before believing the results.
2) Is there any reason why the burnin length is so long compared to the sampling length (3000 vs. 4050) samples? Also, what was the initial spectrum used to produce the first sky sample by the standard Gibbs step?
3) What is the typical correlation length of the resulting Markov chain? Are two consequtive samples uncorrelated? If not, how does it depend on l (or S/N)?
Will be interesting to see how this works out – I'd be very interested in adding an additional Gibbs sampling step at intermediate and high l's (socalled subspace sampling; Eriksen et al. 2007), to improve the low S/N efficiency of the Gibbs sampler. So if this method does in fact produce samples from the exact posterior (ie., the proposal density is indeed symmetric), then that's a very interesting option. 

Back to top 


Mark Ashdown
Joined: 27 Sep 2004 Posts: 3 Affiliation: University of Cambridge

Posted: August 24 2007 


Hans Kristian wrote:  First, it seems to me that this is a fairly standard Metropolis algorithm, really, and the "Hamiltonian" part of this is really just a proposal density that provides fairly uncorrelated samples, right? If so, is it obvious that the proposal density T(x,x') is symmetric (T(x,x') = T(x',x)), so that one does not have to apply the more general MetropolisHastings rule in place of equation 14 in this paper? It's probably quite simple, but I don't see it right now.. 
It isn't obvious. The proof that the Hamiltonian sampler preserves detailed balance using the Metropolis acceptance rule is quite complicated. You have to consider the proposal density in the (x, p) phase space, T[(x, p), (x', p')], and then integrate over the nuisance parameters p and p'. See the original Duane et al. paper that introduced the method (Physics Letters B, vol. 195, issue 2, p. 216 (1987)).
Hans Kristian wrote:  1) I'm not sure I'd say that a GelmanRubin statistic of 1.2 represent "excellent convergence". I've seen too many weird results at this level of convergence, and for fast applications like CosmoMC, I usually prefer 1.01. Also for the Gibbs sampler, I require convergence at the few percent level before believing the results. 
Perhaps that was a bit too glib. The Hanson statistic we use is more pessimistic than the GelmanRubin statistic, but nevertheless it would be better to have it closer to 1. We compute R for each individual parameter. With the best part of a million parameters, getting it to within 1% of 1 would for every parameter be very good indeed!
Hans Kristian wrote:  2) Is there any reason why the burnin length is so long compared to the sampling length (3000 vs. 4050) samples? Also, what was the initial spectrum used to produce the first sky sample by the standard Gibbs step? 
During the burnin phase, we update the masses in the Hamiltonian. At the start, the masses are set using the approximate variances of the parameters given by equations 28 and 29. During burnin they are updated using the variance of the alreadycomputed samples. Also, for this proofofconcept paper, we were simply being cautious about when to end the burnin phase.
The initial power spectrum used to produce the signal Gibbs sample was the input theoretical spectrum. Maybe that's cheating, but for the temperature spectrum, we are going to know it pretty well anyway.
Hans Kristian wrote:  3) What is the typical correlation length of the resulting Markov chain? Are two consequtive samples uncorrelated? If not, how does it depend on l (or S/N)? 
We haven't computed the correlation length of the samples yet. The limiting factor is the masses. The better they are, the further we can go in phase space with every leapfrog step and the less likely the samples are to be correlated. 

Back to top 


Hans Kristian Eriksen
Joined: 25 Sep 2004 Posts: 58 Affiliation: ITA, University of Oslo

Posted: August 25 2007 


Mark Ashdown wrote: 
Perhaps that was a bit too glib. The Hanson statistic we use is more pessimistic than the GelmanRubin statistic, but nevertheless it would be better to have it closer to 1. We compute R for each individual parameter. With the best part of a million parameters, getting it to within 1% of 1 would for every parameter be very good indeed!

Well, it depends on how you look at it. The thing is that the power spectrum coefficients C_{l} are all fairly uncorrelated, and the fact that there are a thousand of them doesn't really matter in terms of sampling efficiency. Because they are "uncorrelated", it's not much harder to get good convergence for a thousand than it is for one. So typically, even with ~10,000 samples, I usually find that R is much, much less than 1.1 for the Gibbs sampler for temperature spectra. For the lowl polarization paper, I produced 100,000 samples fairly quickly, and this give R less than 1.01.
Mark Ashdown wrote: 
Hans Kristian wrote:  3) What is the typical correlation length of the resulting Markov chain? Are two consequtive samples uncorrelated? If not, how does it depend on l (or S/N)? 
We haven't computed the correlation length of the samples yet. The limiting factor is the masses. The better they are, the further we can go in phase space with every leapfrog step and the less likely the samples are to be correlated. 
OK. This will be a very interesting quantity to know. The thing is of course that the Gibbs sampler produces uncorrelated samples in the high S/N regime, and the question of which approach is faster depends not on the number of spherical harmonics transform to produce one single sample, but rather to produce two *uncorrelated* samples. That's the problem with typical MCMC approaches – they produce strongly correlated samples, and one has to thin the chain quite heavily. 

Back to top 


Antony Lewis
Joined: 23 Sep 2004 Posts: 1351 Affiliation: University of Sussex

Posted: August 28 2007 


This is certainly an interesting idea, but I agree with Hans Kristian's last point.
Related to this, using a convergence criterion based only on each C_{l} does not at all guarantee that the method is working. A sky cut will correlate the a_{lm} (and hence C_{l}), and to claim the method is better than using (say) PseudoC_{l}, it needs to get the full correlation structure right (or at least parameterize the full posterior C_{l} distribution more accurately than you can from PseudoC_{l}). If you have a set of variables that are correlated, but use a proposal distribution (mass matrix) that does not know about this correlation, it is quite easy to end up with samples that look OK in terms of an R statistic for each parameter independently, but where other directions in parameter space are very poorly explored. One way a problem might show up would be to work out the convergence statistic applied in the eigendirections of the estimated posterior covariance. [In the parameter estimation context. the single Rvalue returned by CosmoMC's GetDist is the value for the worst of these directions]
I would expect that for the method to work well you'd need to use an accurate mass matrix for the a_{lm}/pixels. To get this you need to have already solved the problem, or at least have a good approximation. Given we know the a_{lm} are highly correlated (linear combinations corresponding to modes localized in the cut are essentially unconstrained), is there any reason to expect a diagonal mass matrix to work reasonably? For example to explore the range of a mediuml C_{l}, all the pixels values in the cut area may need to go coherently up and down; I would have thought this is a very rare proposed move (or requires very many Hamiltonian steps to explore) if the pixels are taken to be uncorrelated.
Also, I presume the BlackwellRao curves for l=200 are calculated in an uncorrelated approximation? (I think the full result for l_{max} > 200 requires a very large number of independent samples to converge)
Incidentally, Radford Neal's review, which discusses the Hamiltonian method along with explanation of expected scaling with dimension, is available online at
http://www.cs.toronto.edu/~radford/review.abstract.html 

Back to top 


Hans Kristian Eriksen
Joined: 25 Sep 2004 Posts: 58 Affiliation: ITA, University of Oslo

Posted: September 10 2007 


Thanks! I now think I understand why and how the Hamiltonian sampler works, at least in terms of the general algorithm.
I was just wondering – would it be possible to post a couple of trace plots of C_{l} vs. iteration count for, say, l=2, 200 and 700 here? I'd be very interested in having a look at these..
Also, did the authors figure out whether they were just unlucky with the multipoles shown in Figure 3 in the paper, all being centered on the theoretical spectrum, or if there actually remains some influcence of the input spectrum?
My main interest in these issues stems from the fact that we need a better sampler in the low signaltonoise regime, where the exact Gibbs sampler doesn't perform terribly well. I still doubt that any "general MCMC type" algorithm can compete in the high S/N regime, because of the strong correlations among a_{lm}'s, but in the low S/N regime, it's quite possible, since the χ^{2} is quite insensitive to these modes and the joint posterior is "widened out". So, as mentioned above, I'd be very interested in seeing the C_{l} trace plots as a function of l. It might even be a good idea to include a couple of these in the paper...
PS! Mark mentioned above that the mean positions in Figure 1 were plotted incorrectly. I now also noted that there is a quite surprising difference in the error bars between the Hamiltonian sampler and the MASTER results at intermediate l's, such as l ~ 200 and most clearly at l~275. Are these error bars plotted correctly? 

Back to top 


Jeremy Taylor
Joined: 24 Jul 2006 Posts: 2 Affiliation: Cavendish Astrophysics, University of Cambridge

Posted: September 25 2007 


Mark is rather busy at the minute so I'll try and answer some of your questions instead.
Firstly correlation lengths. We've examined the correlation of the samples used in the paper and found that the correlation length tends to be rather long in comparison to the Gibbs sampler. However there is no strong dependence on the signal to noise ratio so even at high multipoles we can do a reasonable job of sampling the distribution. It is also worth pointing out that the correlation length for Hamiltonian Monte Carlo depends strongly upon how well one has computed masses for each parameter and also upon the number of leapfrog steps taken during each trajectory. We expect to be able to do much better than the current results with some more work on how best to tune the masses. Below is a plot I have made of the correlation coefficient for a number of the C_{l}s and hopefully you can see that indeed the correlation length is only weakly dependent on .
There was also a request to see some plots of C_{l} through the chain so here they are for a few s. These are a subset of the samples we got spanning the end of burn in and beginning of sampling. The horizontal line shows the value of the realization spectrum. Again the rather long correlation length is obvious but the samples do appear to explore the distribution over a wide range in .
I'd also like to suggest that the use of a diagonal mass matrix is less damaging to performance than it might seem. Hamiltonian Monte Carlo tends to be extremely good at exploring correlated distributions, David MacKay's book http://www.inference.phy.cam.ac.uk/mackay/itila/ has some simple illustrations of this, and this is often given as a primary reason for using the method.
We've had another look at the distributions, the coincidences seemed worrying but we think this is okay. To do some further checks I created some low resolution simulations (Nside = 32), on which I could run a simple Gibbs sampler in a reasonable amount of time, and compared the samples gained from our Hamiltonian Monte Carlo sampler and a Gibbs sampler. The final results were extremely similar but here are a few plots that might be more convincing. In each plot the grey filled histogram is created from the HMC samples and the unfilled blue histogram is from the Gibbs sampler. The red shows the input spectrum and the green shows the value for the realization. Firstly a few for the case of no cut where we see the maximum of the distribution matches the value of the realization spectrum closely as well as good agreement between the samplers. I've shown l = 7, 30, 50 (l=7 because I can't get my plotting program to do log scales and the lowest multipoles are difficult to read on a linear scale)
Then with a Kp2 cut we again get excellent agreement between the Gibbs and Hamiltonian samplers but now the maximum and realization values are not expected to coincide. Here are three examples.
Thanks for all the comments. 

Back to top 


Hans Kristian Eriksen
Joined: 25 Sep 2004 Posts: 58 Affiliation: ITA, University of Oslo

Posted: September 25 2007 


Jeremy Taylor wrote:  Mark is rather busy at the minute so I'll try and answer some of your questions instead.

Thanks! These plots are very illustrative, indeed!
Jeremy Taylor wrote: 
Below is a plot I have made of the correlation coefficient for a number of the C_{l}s and hopefully you can see that indeed the correlation length is only weakly dependent on .

Hmm... Well, yes, at least things do seem to be less predictable here than for the Gibbs sampler, in the sense that the l=200 function has a shorter correlation length than that for l=50. On the other hand, the l=500 has the longest, which is qualitatively similar behaviour to the Gibbs sampler. However, the main point is that all these modes are relatively high signaltonoise, except for the l=500 one which has S/N ~ 0.5, I would guess. So it's difficult to compare, really, with Figure 8 in our paper. What we are really interested in is the range between l=400 and l=1000, or so – this is where the Gibbs sampler becomes inefficient. At l<400, the Gibbs sampler is pretty much perfect in terms of sampling efficiency and mixing, and so I don't worry much about these modes. It's where S/N < 1 things are troublesome..
So right now this doesn't seem quite as promising as I hoped it would be for the low S/N regime – my initial hope was that the correlation length for the Hamiltonian sampler would become *shorter* in the low S/N regime than in the high S/N regime, because the intrinsic a_{lm} correlations would be hidden by noise. Then the perfect hybrid would be the Gibbs sampler at S/N > 1 and the Hamiltonian sampler at S/N < 1. That doesn't seem to be the case from this plot, I think.. Hmm...
PS! What is happening at the end (right hand side) of the iteration plots, when the acceptance rates seem to fall dramatically..? 

Back to top 


Jeremy Taylor
Joined: 24 Jul 2006 Posts: 2 Affiliation: Cavendish Astrophysics, University of Cambridge

Posted: September 26 2007 


The reason for the sudden drop in acceptance rate is that at some point there we've switched between the burn in phase and the sampling phase. During burn in we can tune the step size to maintain a high acceptance whereas in the sampling phase we can't and it appears that , slightly unfortunately, the burn in ended at a point where the was a rather larger than suitable step size so acceptance rate is then low. This is straightforward to fix.
I think it would be possible for the HMC to be geared towards just the low signal to noise regions and get much better performance. I am unsure however how you could best combine a Gibbs and HMC sampler, I guess you would have to make each full sample consist of a low ell Gibbs sample and a high ell HMC sample, the problem being that for each sample the HMC would have to burn in. 

Back to top 


Hans Kristian Eriksen
Joined: 25 Sep 2004 Posts: 58 Affiliation: ITA, University of Oslo

Posted: September 26 2007 


Jeremy Taylor wrote: 
I think it would be possible for the HMC to be geared towards just the low signal to noise regions and get much better performance. I am unsure however how you could best combine a Gibbs and HMC sampler, I guess you would have to make each full sample consist of a low ell Gibbs sample and a high ell HMC sample, the problem being that for each sample the HMC would have to burn in. 
If you can easily tune the HMC to handle the low S/N efficiently, then that would certainly be very useful!
The algorithm for producing full joint samples is very straightforward, since it's essentially just a variation of the Gibbs sampler: First you draw one fullrange sample using the old Gibbs algorithm. This results in perfect mixing at S/N < 1, but poor mixing at S/N > 1. Then you *condition* on the S/N > 1 C_{l} and a_{lm}'s, and draw a highl joint (C_{l}, a_{lm}) sample with HMC. This is a valid and exact Gibbs sampling algorithm, because you condition on the low l's, and draw exactly from the resulting conditional distribution with HMC. That's a very good thing with the Gibbs sampler – it doesn't matter what algorithm you use to sample from the various conditional distributions. The HMC is in principle just as good as the exact sampling algorithms we use in our approach. It's all a question of convergence properties, nothing else. (Assuming, of course, that the basic algorithms do in fact sample from the correct distributions in the first place.. :)) 

Back to top 




You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum

