## [arXiv:0705.3980] Detection of Gravitational Lensing in the Cosmic Microwave Background

 Authors: Kendrick M. Smith, Oliver Zahn, Olivier Dore Abstract: Gravitational lensing of the cosmic microwave background (CMB), a long-standing prediction of the standard cosmolgical model, is ultimately expected to be an important source of cosmological information, but first detection has not been achieved to date. We report a 3.4 sigma detection, by applying quadratic estimator techniques to all sky maps from the Wilkinson Microwave Anisotropy Probe (WMAP) satellite, and correlating the result with radio galaxy counts from the NRAO VLA Sky Survey (NVSS). We present our methodology including a detailed discussion of potential contaminants. Our error estimates include systematic uncertainties from density gradients in NVSS, beam effects in WMAP, Galactic microwave foregrounds, resolved and unresolved CMB point sources, and the thermal Sunyaev-Zeldovich effect. [PDF]  [PS]  [BibTex]  [Bookmark]

Discussion related to specific recent arXiv papers
Simon DeDeo
Posts: 44
Joined: October 26 2004
Affiliation: Santa Fe Institute
Contact:

### [arXiv:0705.3980] Detection of Gravitational Lensing in the

Let's all talk about this amazing result! I am just reading the paper now so do not have anything profound to say yet, but would be curious to hear the thoughts of the community.

Uros Seljak
Posts: 4
Joined: November 09 2006
Affiliation: University of Zurich
Contact:

### [arXiv:0705.3980] Detection of Gravitational Lensing in the

This is a very nice work, we (with C. Hirata, S. Ho and N. Padmanabhan) have done an independent analysis of the same data and get about 2.5 sigma signal from NVSS. We also have about 2 sigma signal from SDSS LRG+QSO sample. We plan to submit it in a few weeks. For NVSS the difference is probably in us taking a more conservative approach on a number of potential systematics, like dealing with point sources etc. This is not to say that their higher S/N is not believable, as I think they have done a good job, just that it is often a judgement call on where to make a cut in the data to remove systematics which can induce spurious signal and it is difficult to prove one way or the other when one has a low S/N, since the evidence for a systematic effect itself may not be statistically significant. And one can make the opposite statement that by adopting a stricter cut to remove a systematic effect one has removed more of the actual signal and lowered S/N, so there is no way to prove one over the other with any statistical significance. There were similar issues regarding various ISW detection claims in the past. This is why a detection should not be defined with 2 or 3 sigma evidence, 5 sigma definition would be better and particle physicists often laugh at 2 or 3 sigma definitions of a detection that somehow seem to have been adopted in cosmology.

Anze Slosar
Posts: 183
Joined: September 24 2004
Affiliation: Brookhaven National Laboratory
Contact:

### [arXiv:0705.3980] Detection of Gravitational Lensing in th

Kudos to Kendrick and company - a proper paper without any short-cuts taken!

I have yet to find time to read these papers carefully, but wouldn't this technique be able to shed light on claimed correlation between NVSS number density and the cold spot as advocated in arxiv:07040908 (debated here http://cosmocoffee.info/viewtopic.php?t=864)? If there really is this huge underdensity, then I would guess that one should get considerable amounts of lensing from it and so you might be able to detect it from either ordinary 3-point or this 2+1 point correlation function?

Oliver Zahn
Posts: 9
Joined: March 10 2005
Affiliation: Berkeley Center for Cosmological Physics

### [arXiv:0705.3980]

thanks for the comments...

Perhaps particle physicists always made their judgment calls (as to how deep to digg into the systematics) too early, so they learned that they'd better wait for 5 sigma to come along before claiming a detection ;-)

A question to Uros et al. about your NVSS-only analysis: is the difference between the 2.5 and our 3.4 sigma (both stat+syst) mainly the result of additional systematic errors, e.g., as you suggest, due to different treatment of point sources, or is it the different NVSS sample construction? Or could much of it already be present in the statistical significance, due to differences between the lensing estimators, or other small differences between our analysis pipelines?

Oliver Zahn
Posts: 9
Joined: March 10 2005
Affiliation: Berkeley Center for Cosmological Physics

### Re: [arXiv:0705.3980]

Anze Slosar wrote:wouldn't this technique be able to shed light on claimed correlation between NVSS number density and the cold spot as advocated in arxiv:07040908 (debated here http://cosmocoffee.info/viewtopic.php?t=864)? If there really is this huge underdensity, then I would guess that one should get considerable amounts of lensing from it and so you might be able to detect it from either ordinary 3-point or this 2+1 point correlation function?
Since that cold spot takes up <1% of the sky, and forecasts based on WMAP and NVSS specs (without assuming any anomalies) agree well with the significance found in the data, it looks like there's not enough of a statistical handle in WMAP to see lensing due to this void. It would be interesting if smaller scale CMB experiments were pointed that way though.

Kendrick Smith
Posts: 3
Joined: August 03 2006
Affiliation: University of Chicago

### Re: [arXiv:0705.3980] Detection of Gravitational Lensing in

Uros Seljak wrote:This is a very nice work, we (with C. Hirata, S. Ho and N. Padmanabhan) have done an independent analysis of the same data and get about 2.5 sigma signal from NVSS. We also have about 2 sigma signal from SDSS LRG+QSO sample. We plan to submit it in a few weeks. For NVSS the difference is probably in us taking a more conservative approach on a number of potential systematics, like dealing with point sources etc. This is not to say that their higher S/N is not believable, as I think they have done a good job, just that it is often a judgement call on where to make a cut in the data to remove systematics which can induce spurious signal and it is difficult to prove one way or the other when one has a low S/N, since the evidence for a systematic effect itself may not be statistically significant. And one can make the opposite statement that by adopting a stricter cut to remove a systematic effect one has removed more of the actual signal and lowered S/N, so there is no way to prove one over the other with any statistical significance. There were similar issues regarding various ISW detection claims in the past. This is why a detection should not be defined with 2 or 3 sigma evidence, 5 sigma definition would be better and particle physicists often laugh at 2 or 3 sigma definitions of a detection that somehow seem to have been adopted in cosmology.
First, I completely agree with the particle physicists' sentiment that the definition of a detection should be something like 5 sigma. (I have heard the rule of thumb that "half of all 3 sigma results turn out to be wrong" and am sure this comes from experience!)

One effect which I think is underappreciated is the danger of modifying the analysis after "looking at the answer". I have frequently seen cases where a key result shifts by ~0.5 sigma after a 5% percent change in data selection. If one is trying to decide whether to include a particular data cut, and already knows that including it shifts the result toward the hoped-for result, it's hard to be unbiased. I imagine one can unintentionally wander quite far from the true answer this way...

In our lensing paper, we did freeze our data selection before looking at the result for the first time, so I think we're OK as far as this source of bias is concerned. (The only exception was marginalizing m=0 modes in NVSS, which we added after looking at the galaxy auto spectrum.) I suspect this mainly reflects all the hard work put in by the primary WMAP and NVSS analysis teams to produce clean data, so that all our null tests and checks worked on the first try. The real solution is to do a "blind analysis" as commonly performed in particle physics; it would be great to see this level of rigor make its way to cosmology!

From email exchanges with Chris, I suspect the main difference in our analyses which might account for the 0.9 sigma difference is that we include Q-band WMAP data in addition to V+W. Given this 50% change in data volume on the CMB side and other smaller differences in the pipelines (such as estimator construction, NVSS source mask...) I'm not surprised that the two final results can differ by 0.9 sigma.

We did worry a lot about NVSS systematics: we felt that WMAP is well-understood enough to be simulated reliably (a possible exception is point sources, which is why we spent so much time working on point source systematic errors!) but we were worried about systematics hidden in NVSS not captured by our simulations.

One strong check is that we get the same statistical errors by cross-correlating WMAP simulations to the NVSS data, treating the latter as a "black box". This shows that the result only depends on the WMAP simulations, and in particular is robust to any NVSS systematic unless there is some reason why the systematic would be in the WMAP data but not WMAP simulations. For this reason (and all the other systematics checks including point source estimates), I don't see how we could be strongly affected by systematics related to the NVSS sample construction.

My feeling is that systematics for lens reconstruction are difficult and largely unexplored, but much tamer in cross-correlation, which is what made this measurement robust (as far as we can tell...)
There is probably a lot of work needed to get a well-controlled "internal" measurement of CMB lensing from experiments like Planck, ACT, SPT, which are not that far away!

Uros Seljak
Posts: 4
Joined: November 09 2006
Affiliation: University of Zurich
Contact:

### [arXiv:0705.3980] Detection of Gravitational Lensing in the

I did not mean to suggest that the difference is caused by systematics, as I have no reason to believe that in the case of this analysis. My point was simply that if a difference between two results is in itself not statistically significant then there is no need to worry about it, as it could be caused by any number of effects, like choice of data, weighting, method for error calculation, where to cut the data to remove systematics etc. In this case the difference in \chi^2 is about 5, which is hardly very significant and this could be caused by any of the mentioned effects (and Kendrick's suggestion what may be the primary source of difference may well be correct). The point is that a difference between 2.5 and 3.4 sigma seems large and worrisome (one is a detection and the other is not?), while a difference between 5 and 5.5 sigma seems much less worrisome, even though in terms of a change in \chi^2 they are the same, which is why adopting a 5 sigma standard makes this much less of an issue.

Kendrick's example of modifying the analysis aposteriori is a good one and things are made worse if we adopt a 2 sigma standard for detection (and there are cosmology papers out there doing this, you know who you are :-), since it is the change in \chi^2 that is relevant, so a change of 4 may only be 0.5 sigma if you start from 4 sigma, but is in fact 2 sigma if you start from 0. This can either be done intentionally and aposteriori, ie one can imagine someone doing multiple analyses using different data subsets or weights or error estimation methods etc and choosing the one with highest S/N, but may also happen unintentionally and seemingly not aposteriori, ie one group does their analysis and finds nothing, but another group with different cuts etc finds a 2 sigma deviation and claims a detection, without realizing that if twenty groups did an analysis of a random subset of the data with no signal one among them will find a 2 sigma detection. Finally, there is the issue of absence of any stakes in this: for many of these first detections the signal is well predicted and nobody really doubts it should be there, so one is not betting on his or her own reputation by making a detection claim, which I think is part of the reason for adopting the low sigma standard of detection. Ie, if one is a Bayesian the detection has already been made before the data were analyzed and the posterior is a small perturbation to the prior. If one were claiming something crazy and unexpected then one would think twice before making a detection claim out of a 2 sigma effect in the data (although this still does not stop some from doing so). But again, these are just general comments and none of this is directed to this paper which is a very nice piece of work.

Thomas Dent
Posts: 26
Joined: November 14 2006
Affiliation: ITP Heidelberg
Contact:

### [arXiv:0705.3980] Detection of Gravitational Lensing in th

How much better do 'detections' or measurements need to be before alternative theories of gravity (which lead to different relations between lensing and other measures of density) start to be tested? Quite a lot, I'd guess.

TD

Written under protest at cosmocoffee's exclusionary policy

Roberto Trotta
Posts: 18
Joined: September 27 2004
Affiliation: Imperial College London
Contact:

### Re: [arXiv:0705.3980] Detection of Gravitational Lensing in

Kendrick Smith wrote:
First, I completely agree with the particle physicists' sentiment that the definition of a detection should be something like 5 sigma. (I have heard the rule of thumb that "half of all 3 sigma results turn out to be wrong" and am sure this comes from experience!)
Actually, it turns out this comes from statistics! You can show that typically 50% of all 95% (ie, 2 sigma) results are wrong, in that they wrongly reject a true null hypothesis (this means the effect they claim was there is actually absent). This is well known in the stats community, and it goes under the name of "p-value" fallacy (see http://uk.arxiv.org/abs/0706.3014 and references therein).

The issue boils down to a misinterpretation of what an n-sigma result actually means: namely, if you have an $\alpha$% result, then $1-\alpha$ is the probability that you get data as extreme or more extreme than what you have observed, assuming the null hypothesis is true. You are not allowed (unless you use Bayes theorem) to interpret this as the probability of the null hypothesis to be true, which is actually what you are usually after. The corollary is that in the long run the fraction of $\alpha$% results that are wrong is much larger than $1-\alpha$. In fact, it can be an order of magnitude larger! This has nothing to do with Gaussianity or with sample size.
Uros Seljak wrote: This is why a detection should not be defined with 2 or 3 sigma evidence, 5 sigma definition would be better and particle physicists often laugh at 2 or 3 sigma definitions of a detection that somehow seem to have been adopted in cosmology.
I agree with Uros here (although I've been one of the culprits jumping on a 2-sigma-ish result in the past!). If you use Bayesian calibrated p-values, then you find that the threshold for "strong evidence at best" (i.e., the maximum evidence that the alternative is ever going to get under any prior among a suitably defined family) corresponds to 3.6 sigma. So this could be a good reason to require stronger standards of evidence in cosmology.

Thomas Dent
Posts: 26
Joined: November 14 2006
Affiliation: ITP Heidelberg
Contact:

### [arXiv:0705.3980] Detection of Gravitational Lensing in th

Actually, in particle physics 5 sigma is required for 'discovery'. 3 sigma is commonly called 'evidence'. Anything above 2 sigma may be talked about in some terms ... but probably doesn't deserve to be.

If you want to invent a new standard for 'detection' (which should mean the same as 'discovery' but sounds a bit less dramatic) in cosmology then you are free to do so... just so long as people have a clear idea what it means. 3.6 sigma seems a bit weak to me if you want a 'detection' to be undisputable.

From Gordon and Trotta:
The comparison between p-values and Bayes factors
suggests a threshold of &#8472; = 3 × 10&#8722;4 or &#963; = 3.6 is needed if
the odds of 150:1 (“strong” support at best) are to be ob-
tained. However, as systematic effects in the data analysis
typically lead to a shift of order a sigma, it follows that the
“particle physics threshold” of 5&#963; may be required in order
to obtain odds of at best 150:1.
I'm not sure what to make of this. Particle physics doesn't have a 'threshold' of 5 sigma; theorists commonly find 3 sigma deviations interesting enough to write about and experimenters will draw attention to them, but without saying 'detection' or 'discovery'.

More worrying is the throwaway remark about systematics accounting for 'about a sigma'. Is that a rule of thumb based on experience? Whose? What does it mean to combine the crystalline mathematical framework of Bayesian statistics with folk wisdom about systematics? (Not that I disparage folk wisdom, eg the 'rule' that half of 3 sigma results are bogus, but it doesn't fit with what the principles of the paper seem to be.)

Even if we do admit rules of thumb, this one seems suspect. Why should systematics (in completely different analyses!) produce effects about the same as the statistical error - isn't it likely that the effect might sometimes be significantly smaller, sometimes larger?

What is a 'systematic in the data analysis' anyway? I thought systematics were physical effects which lead to a more or less badly controlled shift in values and thus increase your uncertainty. Perhaps Gordon/Trotta mean a difference in the estimate of systematics between two analyses - i.e. an uncertainty in the size of the systematic uncertainty.

In any case, systematic errors are treated differently in particle physics: they are controlled and estimated and quoted in addition to the statistical error. Barring embarrassing mistakes, their size is well known. (Of course, that is much easier when you know how the whole apparatus was built.) If (say) CDF claimed 5 sigma discovery, then later revised it down to only 3.6 sigma due to a wrongly estimated systematic, I think it would be a major embarrassment.

So how about applying some meaningful statistical method - Bayesian? - to data selection and estimation of systematics? Hopefully it's not just a matter of taste.

Christopher Gordon
Posts: 14
Joined: September 27 2004
Affiliation: University of Canterbury
Contact:

### Re: [arXiv:0705.3980]

Thomas Dent wrote:Actually, in particle physics 5 sigma is required for 'discovery'. 3 sigma is commonly called 'evidence'. Anything above 2 sigma may be talked about in some terms ... but probably doesn't deserve to be.

From Gordon and Trotta:
The comparison between p-values and Bayes factors
suggests a threshold of &#8472; = 3 × 10&#8722;4 or &#963; = 3.6 is needed if
the odds of 150:1 (“strong” support at best) are to be ob-
tained. However, as systematic effects in the data analysis
typically lead to a shift of order a sigma, it follows that the
“particle physics threshold” of 5&#963; may be required in order
to obtain odds of at best 150:1.
I'm not sure what to make of this. Particle physics doesn't have a 'threshold' of 5 sigma; theorists commonly find 3 sigma deviations interesting enough to write about and experimenters will draw attention to them, but without saying 'detection' or 'discovery'.
By 'particle physics threshold' we meant 'particle physics threshold of discovery'.
Thomas Dent wrote: More worrying is the throwaway remark about systematics accounting for 'about a sigma'. Is that a rule of thumb based on experience? Whose? What does it mean to combine the crystalline mathematical framework of Bayesian statistics with folk wisdom about systematics? (Not that I disparage folk wisdom, eg the 'rule' that half of 3 sigma results are bogus, but it doesn't fit with what the principles of the paper seem to be.)
That was just a rule of thumb based on recent cosmology examples like the scalar spectral index and correlations between CMB lensing and large scale structure. The point of the paper was not to give a definite number of sigma for 'detection' but to clarify the relationship between p-values (or sigma) and Bayes factors. The 1 sigma systematic was just to illustrate a scenario where a 5 sigma result may only correspond to odds of 150:1. But, of course one should try to accurately estimate the systematics.

Patrick McDonald
Posts: 16
Joined: November 06 2004
Affiliation: CITA
Contact:

### [arXiv:0705.3980] Detection of Gravitational Lensing in the

There is a reasonably good practical reason why (unknown) systematic problems in analysis of data like this become increasingly likely at a level that tracks the statistical errors (i.e., at least ~1-sigma):
One good way to find problems is to do the analysis on subsets of the data broken up based on criteria that should *not* affect the results, and look for differences. But this only works if the problem is big enough to detect without using the whole data set, i.e., many sigma.
This is a reason, beyond any technical statistics reason, why the real believability of a measurement is always less than you'd say based on the number of sigma.

While I'm writing:
One thing I think a lot when reading discussions of how one should interpret the significance of some measurement is that the discussions could benefit a lot from more concrete context. These discussions only really matter when there is some decision to be made, e.g., whether to work on one theory or another, or build one experiment or another, and these decisions generally bring some considerations and guidance on how to proceed.
For example: I have two potential detections of unrelated, equally interesting, new physics. I can build two equally difficult experiments to test them. I need to try to assess the relative significance of the two potential detections, and the decision is a bit subtle because presumably you want to build the experiment that tests the effect that is more likely to be real, up to the point where it is so significant that you are certain it is there and don't need to test it anymore.
As another example: if you're a phenomenological theorist and are thinking of working out further consequences of one theory or another, it seems reasonable to choose the one favored by the data after applying as a prior your best judgment as to which is more likely, or other things like how easy they are to work with.
I know there are plenty of discussions and papers that take this more concrete approach - my point is that any other kind of discussion is very likely to be a waste of time.

Bottom line:
The significance of a detection of an effect that isn't affecting our decision making in any way, just doesn't really matter
(which isn't to say that the authors don't need to compute it correctly, since presumably if the result is interesting it will generally affect some decisions).

Pat

Christopher Gordon
Posts: 14
Joined: September 27 2004
Affiliation: University of Canterbury
Contact:

### Re: [arXiv:0705.3980] Detection of Gravitational Lensing in

Patrick McDonald wrote:

One thing I think a lot when reading discussions of how one should interpret the significance of some measurement is that the discussions could benefit a lot from more concrete context. These discussions only really matter when there is some decision to be made, e.g., whether to work on one theory or another, or build one experiment or another, and these decisions generally bring some considerations and guidance on how to proceed.
For example: I have two potential detections of unrelated, equally interesting, new physics. I can build two equally difficult experiments to test them. I need to try to assess the relative significance of the two potential detections, and the decision is a bit subtle because presumably you want to build the experiment that tests the effect that is more likely to be real, up to the point where it is so significant that you are certain it is there and don't need to test it anymore.
As another example: if you're a phenomenological theorist and are thinking of working out further consequences of one theory or another, it seems reasonable to choose the one favored by the data after applying as a prior your best judgment as to which is more likely, or other things like how easy they are to work with.
I know there are plenty of discussions and papers that take this more concrete approach - my point is that any other kind of discussion is very likely to be a waste of time.

Bottom line:
The significance of a detection of an effect that isn't affecting our decision making in any way, just doesn't really matter
(which isn't to say that the authors don't need to compute it correctly, since presumably if the result is interesting it will generally affect some decisions).

Pat
One way of incorporating the significance of a detection in the decision making process is to use Bayesian decision theory. One of the quantities needed for Bayesian decision theory is the probability of the data given a model, P(D|M). Our paper discusses techniques of evaluating P(D|M) when one only has p-values or one is very uncertain about the correct priors to place on proposed new parameters.

Chris