[astro-ph/0411737] Cosmological Parameter Constraints as Derived from the Wilkinson Microwave Anisotropy Probe Data via Gibbs Sampling and the Blackwell-Rao Estimator

Authors:  M. Chu, H.K. Eriksen, L. Knox, K. M. Gorski, J. B. Jewell, D. L. Larson, I. J. O'Dwyer, B. D. Wandelt
Abstract:  We study the Blackwell-Rao (BR) estimator of the probability distribution of the angular power spectrum, P(C_l|d), by applying it to samples of full-sky no-noise CMB maps generated via Gibbs sampling. We find the estimator, given a set of samples, to be very fast and also highly accurate, as determined by tests with simulated data. We also find that the number of samples required for convergence of the BR estimate rises rapidly with increasing l, at least at low l. Our existing sample chains are only long enough to achieve convergence at l less than about 40. In comparison with P(C_l|d) as reported by the WMAP team we find significant differences at these low l values. These differences lead to up to approximately 0.5 sigma shifts in the estimates of parameters in a 7-parameter Lambda CDM model with non-zero dn_s/dln(k). Fixing dn_s/dln(k)= 0 makes these shifts much less significant.
Unlike existing analytic approximations, the BR estimator can be straightforwardly extended for the case of power spectra from correlated fields, such as temperature and polarization. We discuss challenges to extending the procedure to higher l and provide some solutions.
[PDF]  [PS]  [BibTex]  [Bookmark]

Discussion related to specific recent arXiv papers
Post Reply
Antony Lewis
Posts: 1941
Joined: September 23 2004
Affiliation: University of Sussex
Contact:

[astro-ph/0411737] Cosmological Parameter Constraints as Der

Post by Antony Lewis » November 30 2004

This paper investigates how many samples of full-sky signal [tex]C_l[/tex] estimators drawn from the posterior given the data you need to accurately estimate parameters. I find the result - that you need a (possibly) exponentially growing number of samples as you increase l_max - rather surprising (and disappointing for what seems like an otherwise very neat method).

As I understand it, for a signal dominated full sky the full sky estimators consistent with the (small) noise should all be nearly identical. Hence the number of samples needed is essentially one, independent of l_max, because all the samples are the same. Somehow, as the sky coverage is decreased and it becomes less signal dominated, this [tex]\ell[/tex]-independent bahaviour has to change into the exponentially bad behaviour found in this paper. Could someone explain where this change is coming from?

I'm also not clear whether samples in this paper correpsonds to independent samples, or Gibbs samples. If the latter, is the exponential bahaviour just the fact that more iterations are required to obtain independent samples?

If one sampled directly from the parameters as a new Gibbs step as suggested in astro-ph/0310080 rather than going via the Blackwell-Rao Estimator is the l-scaling problem avoided?

Hans Kristian Eriksen
Posts: 60
Joined: September 25 2004
Affiliation: ITA, University of Oslo
Contact:

Re: [astro-ph/0411737] Cosmological Parameter Constraints as

Post by Hans Kristian Eriksen » November 30 2004

Antony Lewis wrote:This paper investigates how many samples of full-sky signal [tex]C_l[/tex] estimators drawn from the posterior given the data you need to accurately estimate parameters. I find the result - that you need a (possibly) exponentially growing number of samples as you increase l_max - rather surprising (and disappointing for what seems like an otherwise very neat method).
Yes, the exponential function was rather surprising the first time we saw it, especially since, as you point out, for full sky coverage and no noise, you only need one single sample. The problem is that once you start to get coupling between different sigma_l's (see Fig. 6b of Eriksen et al. 2004, ApJS, 155, 227, astro-ph/0407028) because of the sky cut, the total volume of the space you need to cover becomes huge with increasing l_max. Now, *we* know that the functional form of the likelihood is really very well behaved, and that perhaps only two- or three-point correlations are important, but the *sample-based estimator* doesn't know that, unless we explicitly tell it so (effectively set most of the likelihood space to zero).

In the very first run we made, we applied the BR estimator to the full range between l=2 and 400, and found that one single sigma-sample completely dominated the estimator. And that single sample was the one that by chance happened to have the smallest RMS power spectrum scatter around some imagined model. Now, it is true that this particular sample would have a very good chi^2, but on the other hand, it should also be assigned a very small volume in the likelihood space. But because we have so few samples, it isn't -- the chi^2 is still very good, but the volume (1/N_samples) is quite large. Therefore this single sample drives the fit.

So, the question is how to solve this problem. One approach is of course to generate an infinite number of samples, but that is a little difficult in practice [:-)]. Another is to impose further constraints on the estimator. One option is to split the likelihood space into bands, and apply the estimator to each of the bands, and multiply the partial likelihoods together. Unfortunately, this loses the inter-bin correlations. Another option is to pull out the univariate distributions from the BR estimator, and apply analytical approximate corrections to handle the correlations. There are lots of options for how to do this, and we are still working on this issue.

I still think the method is quite neat, although we currently do have problems with treating the full thing perfectly exactly. Being able to treat bands of delta_l=30 exactly with only 1000 samples isn't that bad, I think. And this is only the very first application of the method -- we will definitely refine it a lot in the future.
Antony Lewis wrote: I'm also not clear whether samples in this paper correpsonds to independent samples, or Gibbs samples. If the latter, is the exponential bahaviour just the fact that more iterations are required to obtain independent samples?
For S/N >> 1, consequtive samples are independent, so the difference doesn't matter. You can have a look at Figure 8 of Eriksen et al. 2004, ApJS, 155, 227 to see the correlation length as a function of S/N. We also verified this in the current paper.
Antony Lewis wrote: If one sampled directly from the parameters as a new Gibbs step as suggested in astro-ph/0310080 rather than going via the Blackwell-Rao Estimator is the l-scaling problem avoided?
We haven't looked into this yet, and my intuition have failed earlier, so I don't think I want to make strong predictions quite yet. However, my gut feeling is that sampling over parameters should help quite a lot, since the dimensionality of the parameter space is so much smaller -- but don't take my word for it; there may be issues I haven't realized yet. So often there are... :-)

Benjamin Wandelt
Posts: 12
Joined: September 24 2004
Affiliation: IAP, UPMC, Sorbonne Universites, and Illinois
Contact:

[astro-ph/0411737] Cosmological Parameter Constraints as Der

Post by Benjamin Wandelt » November 30 2004

I have a few additional comments and (hopefully) clarifications.

The number of samples Hans Kristian referred to in his post were Gibbs samples. So there is the effect of correlations at high l. But this effect does not scale exponentially. There is a separate issue (which Hans Kristian describes in his post) which has to do with the width of the Blackwell-Rao (BR) kernels being narrower than the support of the likelihood in the low S/N regime. This separate difficulty arises because creating a smooth, converging approximation to the likelihood P(C_l|d) given a set of samples is more ambitious than computing *moments* of it. The volume to be covered by the kernels scales up exponentially with l in units of the kernel widths. This is the source of the problem we describe.

I should emphasize that our paper presents a first exploration of the practical issues of implementing the BR estimator for the full likelihood P(C_l|d). We describe various approaches that soften the problem (such as broadening the kernels and banding in l space). In addition, we are currently studying other approaches.

The sampling over parameters I suggested in astro-ph/0310080 does get rid of these problems, since it does not require evaluating P(C_l|d), only P(C_l|s) at each Gibbs iteration. It has other advantages as well - such as optimality within a given parameterized theoretical framework.

One of the neat aspects of the Blackwell-Rao estimator one would like to have the option of retaining is the convenience of separating power spectrum estimation (which requires a great deal of information about the experiment) from parameter estimation. Having the option of decoupling the two makes it possible to communicate a very good approximation of P(C_l|d) in a very compact form. This allows broad interpretation within the context of different theories. That is why I think it is worth pursuing both avenues - direct sampling from parameters, as well as developing a compact representation of P(C_l|d) that can be plugged in to parameter estimation codes.

Antony Lewis
Posts: 1941
Joined: September 23 2004
Affiliation: University of Sussex
Contact:

[astro-ph/0411737] Cosmological Parameter Constraints as Der

Post by Antony Lewis » December 01 2004

Thanks both.

I understand the problem with low S/N at high l, but as I understand it the low-l high S/N (outside cut) behaviour is something different.

As I understand Fig 6 of astro-ph/0407028, the correlations between [tex]independent[/tex] samples should only be a few percent (as expected: since WMAP uses 85% of the sky l-correlations are small). Correlations between Gibbs samples are much larger, however the correlation length doesn't get exponentially worse, so this is not the problem.

What I then have problem understanding is the following: given that the [tex]\sigma_l[/tex] of independent samples are only correlated at a few percent level (and hence in most cases correlations can be safely neglected), how does the number of independent [tex]\sigma_l[/tex] samples required come to grow exponentially with [tex]\ell[/tex]?

Given the S/N on each [tex]C_l[/tex] is high (so the variations of [tex]\sigma_l[/tex] due to different samplings of the true sky in the cut are small compared to the [tex]P(C_l|\sigma_l)[/tex] distribution width), I would expect only of order unity samples required more or less independent of [tex]\ell[/tex] (as long as it remains signal dominated). i.e. I'd expect the scaling with [tex]\ell[/tex] to come almost entirely from the scaling of the correlation length with [tex]\ell[/tex]. Clearly this isn't what you see though!

Benjamin Wandelt
Posts: 12
Joined: September 24 2004
Affiliation: IAP, UPMC, Sorbonne Universites, and Illinois
Contact:

Re: [astro-ph/0411737] Cosmological Parameter Constraints as

Post by Benjamin Wandelt » December 01 2004

Antony Lewis wrote:As I understand Fig 6 of astro-ph/0407028, the correlations between [tex]independent[/tex] samples should only be a few percent (as expected: since WMAP uses 85% of the sky l-correlations are small). Correlations between Gibbs samples are much larger, however the correlation length doesn't get exponentially worse, so this is not the problem.
Correct.
Antony Lewis wrote: What I then have problem understanding is the following: given that the [tex]\sigma_l[/tex] of independent samples are only correlated at a few percent level (and hence in most cases correlations can be safely neglected), how does the number of independent [tex]\sigma_l[/tex] samples required come to grow exponentially with [tex]\ell[/tex]?
In the low [tex]\ell[/tex], high S/N regime, the width of [tex]P(C_\ell|d)[/tex] is [tex]\sim\sqrt{f}[/tex] larger at each l than the BR kernels [tex]P(C_\ell|\sigma_\ell)[/tex], where [tex]f[/tex] is the 1/sky fraction. Therefore, the volume under the likelihood that needs to be covered by the BR kernels grows exponentially (scaling presumably as [tex]f^{\ell/2}[/tex], though I didn't check this against the simulation results) with increasing upper limit [tex]\ell[/tex] in units of the volume of the BR kernels.

The requirement of covering all of the available volume under the function [tex]P(C_\ell|d)[/tex] is a consequence of our goal of providing a smooth representation of this function in terms of the BR kernels, each one of which has a much smaller volume for large [tex]\ell_{max}[/tex].

The two approaches for solving this problem mentioned in this paper attack this problem from two directions. Banding reduces the exponent and kernel broadening reduces the base.

Antony Lewis
Posts: 1941
Joined: September 23 2004
Affiliation: University of Sussex
Contact:

[astro-ph/0411737] Cosmological Parameter Constraints as Der

Post by Antony Lewis » December 03 2004

Thanks, I think I get it now. I persuaded myself there is indeed a problem by trying it. In case it helps anyone else, here's what I did:

The Gibbs sample set will cover each [tex]\ell[/tex] distribution well, giving the right answer. For example, for the [tex]C_l[/tex] I get quite good agreement between simulation input and the mean of the Gibbs samples:

Image

However this is not sufficient, because the BR estimator depends on the average of the sum of the log likelihoods from each [tex]\ell[/tex]. The distribution of [tex]\chi^2[/tex]s of some fiducial model from the Gibbs samples looks something like this:

Image

so the average of [tex]\exp(-\chi^2/2)[/tex] is completely dominated by the low-[tex]\chi^2[/tex] tail (just as Hans Kristian said).

This big spread comes about because there are [tex]\ell_{max}[/tex] random-like variations to the chi-squared for each Gibbs sample. These add up in a random-walk-like fashion giving a big spread of total chi-squared values for large [tex]\ell_{max}[/tex], even if the [tex]C_l[/tex] are totally uncorrelated.

[To generate lots of samples quickly I assumed an azimuthally symmetric cut, with zero noise outside the cut. Working in harmonic space everything is then block diagonal and easy to compute (the coupling matrices are given in e.g. astro-ph/0008083 or including polarization in astro-ph/0106536). I made the approximation that each new full-sky temperature realisation was given by

[tex]T = P_1 T_{obs} + (1-P_1) T_{new}[/tex]

where [tex]T_{new}[/tex] is a new signal realisation with the current [tex]C_l[/tex] and [tex]P_1[/tex] is a projection matrix into the basis of modes well supported on the observed sky. This approximation is exact as [tex]\ell_{max} \rightarrow \infty[/tex] because the eigenvalues of the coupling matrix tend to one or zero. ]

The question then is: if the [tex]C_l[/tex] are in fact very uncorrelated, does one really need to average the total probability from all the [tex]\ell[/tex] - wouldn't averaging each [tex]\ell[/tex] (or small range of [tex]\ell[/tex]) be sufficient? This is the banding option considered in the paper.

Hans Kristian Eriksen
Posts: 60
Joined: September 25 2004
Affiliation: ITA, University of Oslo
Contact:

Re: [astro-ph/0411737] Cosmological Parameter Constraints as

Post by Hans Kristian Eriksen » December 03 2004

Antony Lewis wrote:Thanks, I think I get it now. I persuaded myself there is indeed a problem by trying it.
We seem to think very similarly -- I never trust a result or method until I've actually tested it myself :-)

Thanks for spending time on doing this test!

Hans Kristian

Anze Slosar
Posts: 183
Joined: September 24 2004
Affiliation: Brookhaven National Laboratory
Contact:

Re: [astro-ph/0411737] Cosmological Parameter Constraints as

Post by Anze Slosar » December 03 2004

However this is not sufficient, because the BR estimator depends on the average of the sum of the log likelihoods from each [tex]\ell[/tex]. The distribution of [tex]\chi^2[/tex]s of some fiducial model from the Gibbs samples looks something like this:
Sorry, I don't get this plot, could you explain the axes?

Antony Lewis
Posts: 1941
Joined: September 23 2004
Affiliation: University of Sussex
Contact:

[astro-ph/0411737] Cosmological Parameter Constraints as Der

Post by Antony Lewis » December 03 2004

The second plot is a histogram of the [tex]\chi^2[/tex] values of a standard [tex]\Lambda CDM[/tex] model obtained from all the 4500 odd Gibbs samples in the chain. The [tex]\chi^2[/tex] are for [tex]\ell \le 200[/tex], though I simulated up to [tex]\ell=600[/tex] to make the approximation better. i.e. the x-axis is [tex]\chi^2[/tex], the y-axis is counts.

So almost all the Gibbs samples have a [tex]\chi^2 > 200[/tex], but the contribution to the average probability, [tex]\exp(-\chi^2/2)[/tex], comes almost entirely from the lowest few samples around [tex]\chi^2 \sim 185[/tex].

Antony Lewis
Posts: 1941
Joined: September 23 2004
Affiliation: University of Sussex
Contact:

[astro-ph/0411737] Cosmological Parameter Constraints as Der

Post by Antony Lewis » December 09 2004

Some related questions/ideas.

It seems that there are currently two methods documented (in astro-ph/0310080) for performing parameter estimation using Gibbs chains:
  1. Gibbs sample [tex]\sigma_l[/tex] and [tex]C_l[/tex] using a flat prior on the [tex]C_l[/tex]. Later use these samples to compute the Blackwell-Rao estimator of [tex]P(C_l | d)[/tex], then use this in a new MCMC run to sample from a set of parameters [tex]\theta[/tex].
  2. Using Gibbs to sample from [tex]\sigma_l[/tex] and [tex]\theta[/tex] directly.
Question: are these methods identical? (i.e. are the set of [tex]\theta[/tex] you end up with drawn from identical distributions). This isn't totally obvious to me, as in (1) you are sampling from [tex]\sigma_l[/tex] with a flat (uncorrelated) prior on the [tex]C_l[/tex], wheras in (2) the prior on the theory [tex]C_l[/tex] is highly correlated and determined from the prior on the much smaller set of paramters [tex]\theta[/tex].

Putting it another way: what happens if you use the BR estimator with the samples of [tex]\sigma_l[/tex] drawn using (2) (to run a new MCMC chain to sample from the same set of parameters)? This is clearly(?) not the same (1). I think basically I'm confused about what [tex]P(\sigma_l|d)[/tex] in Eq. 19 of astro-ph/0310080 means, since it seems to depend implicitly on the measure over [tex]C_l[/tex].

Moving on, there seems to be a third protocol worth considering:

3. Use Gibbs sampling as in (1). Then later for each sample of [tex]\sigma_l[/tex] draw a sample of [tex]\theta[/tex] (e.g. using MCMC).

So you end up with a set of samples:

Sample 1 from [tex]P(\theta | \sigma^{(1)}_l)[/tex]
Sample 2 from [tex]P(\theta| \sigma^{(2)}_l)[/tex]
etc.

Protocol (3) does not have the convergence problems of (1), because sampling from [tex]P(C_l | \sigma^{(i)}_l)[/tex] is independent of the absolute value of the probability (so the fact that there is a large spread of [tex]\chi^2[/tex] at high [tex]\ell_{max}[/tex] is irrelevant). It can also be applied many times with different parameter sets without re-doing the Gibbs run (the advantage of (1) over (2)). [by [tex]\sigma_l[/tex] I mean the set of all [tex]\ell[/tex] - so it does know about the full correlation structure]. It does require running a lot of parameter estimation MCMC chains, but that should scale rather well and can be trivially parallelized.

I think methods (3) and (2) are not equivalent because they differ by the priors used on the theory [tex]C_l[/tex] used to generate the [tex]\sigma_l[/tex] when running the Gibbs chain.

However for nearly full-sky I would expect the samples of [tex]\sigma_l[/tex] to be relatively insensitive to the prior on the theory [tex]C_l[/tex] (simply because the prior only affects the signal over the cut, which is a small fraction of the sky). So (3) may be useful?

One could also imagine using priors on the theory [tex]C_l[/tex] which encapulate the known smoothness properties for almost all parameter sets we might be interested in, without actually going all the way to cosmological parameters. For example one could assume the function is smooth and generated from a reduced set of parameters given by the [tex]C_l[/tex] at every 50 in [tex]\ell[/tex] (CMBFAST/CAMB only compute every 50 [tex]\ell[/tex] - the cubic spline from that gives a very accurate result). Another (in specific cases probably equivalent) possibility would be to use a Gaussian process prior on the set of [tex]C_l[/tex] functions. i.e. imposing a prior that nearby [tex]\ell[/tex] are very correlated - which is true for most physical theory [tex]C_l[/tex] but not true for the [tex]\sigma_l[/tex] from actual realisations. I would have thought such a prior on the theory [tex]C_l[/tex] would give samples of [tex]\sigma_l[/tex] very similar to those obtained by going directly to physical cosmological parameters as in (2). Note that even if you impose a prior on the smoothness of the theory [tex]C_l[/tex] you still end up with a set of non-smooth samples of [tex]\sigma_l[/tex] which tell you about the [tex]\ell[/tex]-distribution of power on the actual sky (so anomalies etc would probably still show up, even if somewhat influenced by the prior).

Benjamin Wandelt
Posts: 12
Joined: September 24 2004
Affiliation: IAP, UPMC, Sorbonne Universites, and Illinois
Contact:

Re: [astro-ph/0411737] Cosmological Parameter Constraints as

Post by Benjamin Wandelt » December 09 2004

Antony Lewis wrote: Question: are these methods identical? (i.e. are the set of [tex]\theta[/tex] you end up with drawn from identical distributions). This isn't totally obvious to me, as in (1) you are sampling from [tex]\sigma_l[/tex] with a flat (uncorrelated) prior on the [tex]C_l[/tex], wheras in (2) the prior on the theory [tex]C_l[/tex] is highly correlated and determined from the prior on the much smaller set of paramters [tex]\theta[/tex].
These are not identical, since the posterior density in (2) includes a prior on the range of possible C_l. You can visualize this prior by plotting the population of theory C_l you get from repeatedly running a Boltzmann code such as CAMB or CMBFAST and drawing the cosmological parameters using flat priors. Non-flat priors on the parameters are of course possible.

By contrast in (1), the prior is defined by a flat prior on C_l, without any reference to the assumed underlying theory for the power spectrum.

Assuming a prior which incorporates physics (ie only C_l spectra are allowed that are plausible outputs of a Boltzmann code) adds information that propagates through to the samples of the signal maps, foregrounds, and whatever else is included in the model.

I agree that a smoothness prior will have some of the same features of a "CMBFAST/CAMB prior".

Your option (3) is an interesting idea. It corresponds to marginalizing over the cosmological parameters [tex]\theta[/tex] used in (2). If we denote by [tex]M[/tex] the model assumption implicit in choosing which parameters are present in [tex]\theta[/tex] then this corresponds to constructing a density over new parameters [tex]\theta'[/tex] as

[tex]P(\theta'|M)=\int P(\theta'|\sigma_\ell,M)P(\sigma_\ell,M)d\sigma_\ell[/tex].

The question is then: "To what extent did the choice of parameters during Gibbs sampling affect [tex]P(\theta'|M)[/tex]?" Or put a different way: "How does [tex]P(\theta'|M)[/tex] differ from [tex]P(\theta')[/tex], our original target?" As you say, I expect there will be subtle ways in which the prior influences the [tex]\sigma_\ell[/tex] samples.

Antony Lewis
Posts: 1941
Joined: September 23 2004
Affiliation: University of Sussex
Contact:

Re: [astro-ph/0411737] Cosmological Parameter Constraints as

Post by Antony Lewis » December 10 2004

Benjamin Wandelt wrote:Your option (3) is an interesting idea. It corresponds to marginalizing over the cosmological parameters [tex]\theta[/tex] used in (2). If we denote by [tex]M[/tex] the model assumption implicit in choosing which parameters are present in [tex]\theta[/tex] then this corresponds to constructing a density over new parameters [tex]\theta'[/tex] as

[tex]P(\theta'|M)=\int P(\theta'|\sigma_\ell,M)P(\sigma_\ell,M)d\sigma_\ell[/tex].
I'm not sure I understand this (or maybe you have a different interpretation of (3)). I was thinking of (3) being a Gibbs chain run exactly like (1) - with no cosmological physics entering at all (to start with anyway - maybe smoothness prior or something later on). The cosmological parameters only enter later when generating the set of [tex]\theta[/tex] from the Gibbs samples using MCMC. Or are you suggesting something different?

Antony Lewis
Posts: 1941
Joined: September 23 2004
Affiliation: University of Sussex
Contact:

[astro-ph/0411737] Cosmological Parameter Constraints as Der

Post by Antony Lewis » December 14 2004

We discussed this offline at Fermilab - Ben just misunderstood and thinks (3) should work OK (though not being globally optimal like (2)).

I tried out using (3) versus (1) quickly. Here's a plot using the same toy model as before with [tex]\ell_{max} =20[/tex]:

Image

ns and q are the same pseudo-parameters as used in the paper. Black is the constraint using BR (1), red is the constraint using method (3). Blue is the result obtained from using the full sky input to the simulation. The contours are from MCMC runs - it's much quicker to get the red curve than the less well converged black one. [note my input simulation happened to have little power on large scale, hence the input model is only just on 2 sigma with higher ns being favoured.]

At [tex]\ell_{max} = 200[/tex] it still works very quickly, e.g:

Image

Black and red are the results from applying (3) with the first 500 and second 500 samples from the Gibbs chain (probably far fewer Gibbs samples than this are really needed). Blue is the result with perfect full sky. (The BR result is not calculable at this [tex]\ell_{max}[/tex] )

When there is noise there is a potentially useful trick that can improve (3) further. Running MCMC chains with samples of [tex]\sigma_l[/tex] from a noise-free full-sky gives very tight parameter constraints and may be numerically difficult. Instead if you pull out the isotropic uncorrelated part of the noise and include it with the 'theory' as [tex]\sigma_l+N_l[/tex], you can instead sample from [tex]P(\theta|\sigma_l+N_l)[/tex] which is also known analytically. This distribution is much broader and hence the parameter estimate samples should converge much more quickly. As I understand it Ben agrees with this though thinks there may be problems when the real noise is significantly correlated.

There is an apparent paradox: Method (1) appears to depend almost entirely on the best-fit Gibbs sample only. Method (3) uses all the Gibbs samples equally. What's going on? As I recall Ben's answer: in method (1) the best-fit sample does of course depend on the model being evaluated. And method(1) only converges when each Gibbs sample has contributed non-negligibly, so the 'paradox' is really just that an exponentially large number of samples are needed for (1) - at which point exponentially small constributions to the BR can add up significantly to be non-negligible.

Benjamin Wandelt
Posts: 12
Joined: September 24 2004
Affiliation: IAP, UPMC, Sorbonne Universites, and Illinois
Contact:

[astro-ph/0411737] Cosmological Parameter Constraints as Der

Post by Benjamin Wandelt » December 15 2004

Some comments on these results and the offline discussions Antony and I had:

The exponential blow-up of the number of samples required for the BR estimator is a consequence of working in the high-dimensional space of power spectra.

For parameter estimation, we don't actually need to be able to evaluate the power spectrum likelihood [tex]P(C_\ell|data)[/tex]. We want to generate samples from [tex]P(\theta|data)[/tex], where [tex]\theta[/tex] is a set of cosmological parameters.

Now, by virtue of the equation for the BR estimator of [tex]P(C_\ell|data)[/tex] in Wandelt et al, one can derive both a BR estimator and a sampling scheme for [tex]P(\theta|d)[/tex]. There is a subtlety in the definition of [tex]P(\theta|data)[/tex], but I'll take the definition that is commonly used in cosmological parameter analyses (e.g. using CosmoMC), [tex]P(\theta|data)=P(C_\ell(\theta)|data)[/tex] - see the post scriptum below. Since [tex]\theta[/tex] only has dimension of order 10, this BR estimator will converge much faster.

Owing to the degeneracies between cosmological parameters, the map from parameters to power spectra is many-to-one. In addition, for a given set of parameters [tex]\theta[/tex] the data allows many sets of [tex]\sigma_\ell[/tex] that will fit this [tex]\theta[/tex] equally well. I imagine for example, that the contribution of a single BR kernel would be invariant under the exchange of two [tex]\sigma_\ell[/tex] at different [tex]\ell[/tex] after adjusting the values to compensate for the different widths of the kernel at different [tex]\ell[/tex]. This non-uniqueness provides additional smoothing in the parameter space, lowering the required number of samples.

While it may be interesting to actually evaluate [tex]P(\theta|data)[/tex] in some circumstances, for many analyses a sampled representation is more convenient (hence the success of Markov Chain methods in cosmological parameter estimation). This is accomplished (in principle) by running a Markov Chain for each signal sample, using the inverse gamma function [tex]P(C_\ell(\theta)|\sigma_\ell)[/tex]. As soon as each chain has decorrelated from its initial condition, we can return a few uncorrelated samples from each chain. This sampling procedure can be justified using the BR estimator.

This technique maintains the advantage of a compact representation of the cosmological information in terms of a set of [tex]\sigma_\ell[/tex] taken from Gibbs chains.

There are other practical advantages of this scheme. Each Markov chain explores the BR kernel, which is conditioned on perfect reconstructed signal maps. Therefore the individual Markov chains can be optimized for cosmic variance limit measurements, independent of the scatter of the actual data. Ultimately one obtains the correct uncertainty in the parameter estimates through merging the parameter samples from all the short Markov chains. These chains can be run in parallel.

Beyond the application to WMAP data, Antony mentions potential practical problems for future applications of this technique. For Planck, the fractional cosmic variance at [tex]\ell\sim 2000[/tex] is a few times [tex]10^{-4}[/tex]. This comes close to challenging the accuracy of cosmological Boltzmann codes and may lead to an incorrect exploration of a given BR kernel. But since the spread of parameter samples between different kernels will be much larger than the spread within a single kernel, these errors should have a small effect. Other issues affect CMB parameter estimation in a way that worry me more, such as secondary anisotropies, lensing, point sources, etc.

In summary, all this suggests that it is possible to generate a sample from [tex]P(\theta|data)[/tex] given a power spectrum Gibbs chain. This ought to tame the exponential blow up in the required number of samples presented in our paper which initiated this discussion. It remains to be studied how many independent Gibbs samples are required such that the sample converges in the 10-D parameter space. However, that's a less stringent requirement than asking that we have a smooth representation of [tex]P(C_\ell|data)[/tex] in the much larger space of power spectra. Our current efforts in developing methods to decorrelate samples in the Gibbs chain should aid in satisfying this requirement with a manageable number of samples.

PS: The subtleties I mentioned above relate to how [tex] P(C_\ell(\theta)|data)[/tex] compares to the [tex]P(\theta|data)[/tex] that would be generated by running a Gibbs chain that includes sampling the parameters. This is a separate issue that deserves study and numerical experiments.

Antony Lewis
Posts: 1941
Joined: September 23 2004
Affiliation: University of Sussex
Contact:

Re: [astro-ph/0411737] Cosmological Parameter Constraints as

Post by Antony Lewis » December 17 2004

Here is a very simple toy problem which helps me think about what's going on in the various methods:

Say we have four quadrupole [tex]a_{lm}[/tex] known, those with [tex]|m|>0[/tex]. We want to infer the value [tex]q \equiv C_2 \equiv \langle|a_{lm}|^2\rangle[/tex] of the [tex]a_{lm}[/tex] variance. We know nothing about [tex]a_{20}[/tex]

We know the correct answer for the posterior distribution of q is an inverse Gamma function (with four degrees of freedom).

The Gibbs method of sampling is to sample iteratively from

1. [tex]a_{20}[/tex] from a Gaussian with variance q
2. q from inverse Gamma with 5 d.o.f. max at [tex]\sigma_2[/tex]

where [tex]\sigma_2 = \frac{1}{5}\sum_m |a_{2m}|^2[/tex].

Applying the BR estimator to the samples of [tex]\sigma_2[/tex] (method(1)) gives the right answer (4 d.o.f. inverse Gamma), as does method (3).

Now say that in addition we have the prior information that q > 1. Since the BR estimator directly returns P(q|data), this will still be correct (within normalization) within the prior range.

Method (3) will not however produce the correct answer: to be able to re-sample from the [tex]\sigma_2[/tex] we have to first apply the prior by removing all samples with [tex]q\le 1[/tex]. Re-sampling from the full original chain includes samples with too little power in the faked [tex]a_{20}[/tex].

Method (2) can be applied with the prior information, sampling from q restricted to the prior range. This will give samples of q from the correct distribution. Applying the BR estimate is now more complicated however: we need to compute the correctly normalized [tex]P(q|\sigma_2)[/tex]. This would require accounting for the prior range restriction in the normalization, making it much messier. (Of course we don't need to apply BR as we already have the samples of q that we want.)

Conclusions: Method (3) is only an approximation that will work if the original prior on the [tex]C_l[/tex] is close to the true prior (hence the idea of using a smoothness-prior, etc). As can be seen in my plot in the earlier post, method (3) gives generally wider error bars due to the significant contribution of low multipoles with very large power (in the BR method these contribute negligibly).

In this toy problem, methods (1) and (2) are in fact identical (parameters are samples from the same distributions). I'm still unclear under what more general circumstances this remains true? The more complicated 2D toy parameter estimation problem seems to give (slightly) different results for (1) and (2) for reasons I'm unclear about.

PS. There was actually a bug in the earlier BR plots (screwed up the [tex]\log(\sigma_l)[/tex] term), but the correct ones look similar I think.

Re: including some of the noise with the signal when running parameter chains:
Beyond the application to WMAP data, Antony mentions potential practical problems for future applications of this technique. For Planck, the fractional cosmic variance at is a few times 10 - 4. This comes close to challenging the accuracy of cosmological Boltzmann codes and may lead to an incorrect exploration of a given BR kernel. But since the spread of parameter samples between different kernels will be much larger than the spread within a single kernel, these errors should have a small effect.
In the simple case of a full sky observation with isotropic noise, including the noise with the theory has to be much better for parameter estimation: you only need one Gibbs sample, the full sky [tex]\sigma_l + N_l[/tex]. One MCMC chain (with many samples) for the parameters from this gives you everything else. So I'd have thought it would also be better in general (allowing you to take many samples from each [tex]P(\theta|\sigma_l+N_l)[/tex] for each Gibbs sample).

Post Reply