CosmoMC: seg fault for action=2

Use of Healpix, camb, CLASS, cosmomc, compilers, etc.
Graeme Addison
Posts: 27
Joined: July 17 2014
Affiliation: Johns Hopkins University

CosmoMC: seg fault for action=2

Post by Graeme Addison » August 09 2019

I'm hitting a seg fault when I try running action=2 with the new July 2019 CosmoMC (& new Planck likelihood although it doesn't seem to matter which likelihoods are used). action=0 and action=4 work fine as far as I can tell, so maybe specific to some routine used in the minimizing process? Could I need some extra compiler option etc. that doesn't show up except for action=2?

Pasted last few lines of job output below:

At the return from BOBYQA Number of function values = 156
Least value of F = 3.166520941837106D+02 The corresponding X is:
1.876704D+00 -2.938268D+00 -1.342395D+00 -6.037330D+01 -3.960370D+01
-1.874395D+00 -3.010149D+00
4 Refining minimimum using low temp MCMC
4 Current logLike: 316.652094183711
4 Minimize MCMC with temp 1.000000000000000E-002
3 Stopping as have 140 samples.
slow changes 236 power changes 4
3 MCMC MaxLike = 316.627174630316
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
cosmomc 000000000069DBBD for__signal_handl Unknown Unknown
libpthread-2.17.s 00002BA777B3D6D0 Unknown Unknown Unknown
cosmomc 00000000006884EF Unknown Unknown Unknown
cosmomc 000000000068938F Unknown Unknown Unknown
cosmomc 000000000068A5AC for_alloc_assign_ Unknown Unknown
cosmomc 00000000005AD1EC Unknown Unknown Unknown
cosmomc 00000000004A1E34 Unknown Unknown Unknown
cosmomc 000000000049E55D Unknown Unknown Unknown
cosmomc 0000000000544CC8 Unknown Unknown Unknown
cosmomc 000000000048D3DF Unknown Unknown Unknown
cosmomc 000000000054A91A Unknown Unknown Unknown
cosmomc 00000000005487D9 Unknown Unknown Unknown
cosmomc 000000000046F8E6 Unknown Unknown Unknown
cosmomc 0000000000553882 Unknown Unknown Unknown
cosmomc 000000000040EEDE Unknown Unknown Unknown
libc-2.17.so 00002BA777D6C445 __libc_start_main Unknown Unknown
cosmomc 000000000040EDE9 Unknown Unknown Unknown

Antony Lewis
Posts: 1522
Joined: September 23 2004
Affiliation: University of Sussex
Contact:

Re: CosmoMC: seg fault for action=2

Post by Antony Lewis » August 09 2019

Does cosmomc_debug give something more helpful?

Graeme Addison
Posts: 27
Joined: July 17 2014
Affiliation: Johns Hopkins University

Re: CosmoMC: seg fault for action=2

Post by Graeme Addison » August 09 2019

Output from repeating run with cosmomc_debug:

forrtl: error (73): floating divide by zero
Image PC Routine Line Source
cosmomc_debug 0000000000CAAA8E Unknown Unknown Unknown
libpthread-2.17.s 00002AD02542F6D0 Unknown Unknown Unknown
cosmomc_debug 00000000004CE9C4 powell_constraine 1000 PowellConstrainedMinimize.f90
cosmomc_debug 00000000004C029F powell_constraine 462 PowellConstrainedMinimize.f90
cosmomc_debug 00000000004B69F8 powell_constraine 166 PowellConstrainedMinimize.f90
cosmomc_debug 000000000052C558 minimize_mp_tpowe 253 minimize.f90
cosmomc_debug 0000000000533086 minimize_mp_tpowe 309 minimize.f90
cosmomc_debug 000000000081CDFE MAIN__ 226 driver.F90
cosmomc_debug 00000000004102DE Unknown Unknown Unknown
libc-2.17.so 00002AD025862445 __libc_start_main Unknown Unknown
cosmomc_debug 00000000004101E9 Unknown Unknown Unknown

Antony Lewis
Posts: 1522
Joined: September 23 2004
Affiliation: University of Sussex
Contact:

Re: CosmoMC: seg fault for action=2

Post by Antony Lewis » August 10 2019

hmm, seems to be in the minimizer code. That code should not have changed significantly, are you sure the exact example worked with the previous version? If so, please look at the code diff with the version that works for you and let me know if you see anything that changed around the div-by-zero line that could be causing it.

Pavel Motloch
Posts: 13
Joined: October 14 2016
Affiliation: CITA

Re: CosmoMC: seg fault for action=2

Post by Pavel Motloch » August 28 2019

Got one of these when running a chain. However, not a vanilla CosmoMC so might be related to my changes. Only run into the issue when neutrino mass is allowed to vary.

Error message from one of the walkers:
forrtl: error (73): floating divide by zero
Image PC Routine Line Source
cosmomc_debug 0000000000CA1F2E Unknown Unknown Unknown
libpthread-2.12.s 00007F75FF0D07E0 Unknown Unknown Unknown
cosmomc_debug 0000000000952979 dtauda_ 39 equations.f90
cosmomc_debug 0000000000AF4F77 mathutils_mp_inte 45 MathUtils.f90
cosmomc_debug 000000000086274D results_mp_cambda 580 results.f90
cosmomc_debug 00000000008628F2 results_mp_cambda 590 results.f90
cosmomc_debug 000000000085C62C results_mp_cambda 475 results.f90
cosmomc_debug 00000000005D1F71 calculator_camb_m 671 Calculator_CAMB.f90
cosmomc_debug 00000000005D4571 calculator_camb_m 716 Calculator_CAMB.f90
cosmomc_debug 00000000007EBD6A cosmologyparamete 156 CosmologyParameterizations.f90
cosmomc_debug 0000000000805553 calclike_mp_theor 280 calclike.f90
cosmomc_debug 000000000059B38E calclike_cosmolog 44 CalcLike_Cosmology.f90
cosmomc_debug 0000000000805D09 calclike_mp_theor 299 calclike.f90
cosmomc_debug 00000000007FF368 calclike_mp_getlo 146 calclike.f90
cosmomc_debug 0000000000507F34 montecarlo_mp_tsa 94 MCMC.f90
cosmomc_debug 00000000005108A0 montecarlo_mp_tfa 369 MCMC.f90
cosmomc_debug 0000000000509004 montecarlo_mp_tch 144 MCMC.f90
cosmomc_debug 0000000000563638 generalsetup_mp_t 137 GeneralSetup.f90
cosmomc_debug 0000000000822918 MAIN__ 298 driver.F90
cosmomc_debug 0000000000411B1E Unknown Unknown Unknown
libc-2.12.so 00007F75FEAC7D20 __libc_start_main Unknown Unknown
cosmomc_debug 0000000000411A29 Unknown Unknown Unknown

Antony Lewis
Posts: 1522
Joined: September 23 2004
Affiliation: University of Sussex
Contact:

Re: CosmoMC: seg fault for action=2

Post by Antony Lewis » August 28 2019

Looks odd, grhoa2 (8*pi*G*rho*a**4) on that line should be positive definite; something must be going very wrong with the neutrinos; can you trace the relevant quantities? (still hard to see how you'd get *exactly* zero even if neutrinos returned a random float)

There is of course also the possibility it is a compiler bug (which version?); easier to check if you have a reproducible case.

Pavel Motloch
Posts: 13
Joined: October 14 2016
Affiliation: CITA

Re: CosmoMC: seg fault for action=2

Post by Pavel Motloch » August 30 2019

Sorry, not having much time to deal with this right now and not a high priority for me. But eventually will get to it and let you know, if I find a resolution.

Another thing it spits sometimes is "longjmp causes uninitialized stack frame", which suggests a possible compiler issue. My environment is
1) vim/7.4 9) texlive/2012 17) gdal/1.11
2) subversion/1.8 10) hdf5/1.8 18) python/2.7-2015q2
3) emacs/24 11) netcdf/4.2 19) firefox/esr
4) git/2.7 12) graphviz/2.28 20) fftw3/3.3
5) env/rcc 13) qt/4.8 21) cfitsio/3
6) intel/18.0 14) geos/3.4 22) slurm/current
7) intelmpi/2018.2.199+intel-18.0 15) postgresql/9.2
8) mkl/11.2 16) proj/4.9

Pavel Motloch
Posts: 13
Joined: October 14 2016
Affiliation: CITA

Re: CosmoMC: seg fault for action=2

Post by Pavel Motloch » September 14 2019

Re Antony's comment: Read a bit about this - might not be directly related to what is going on that line. Citing from https://stackoverflow.com/questions/105 ... ringstream :

"Another possibility is a delayed floating-point trap. The floating-point co-processor (built into the CPU these days) generates a floating-point interrupt because of some illegal operation, but the interrupt doesn't get noticed until the next floating-point operation is attempted. So this crash might be a result of the previous floating-point operation, which could be anywhere."

Re the longjump issue: DDT is pointing towards line 137 of "../code/plc_3.0/plc-3.01/src/plik/component_plugin/rel2015/corrnoise.c", in the function "cnoise_compute". A quick diff says the file has changed since 2015 but if possible I would like to avoid digging into how.

Will keep you informed.

Pavel Motloch
Posts: 13
Joined: October 14 2016
Affiliation: CITA

Re: CosmoMC: seg fault for action=2

Post by Pavel Motloch » September 19 2019

Antony, you were right.

After O(4) days of debugging the stochastic SEGFAULT it looks like there are configurations where the neutrino masses are set to negative values.

The culprit is the "perturbative correction for the tiny error due to the neutrino velocity" on line 471 of results.f90 in CAMB.

A single realization of values for which it crashes is
this%nu_masses(1) = 2.17586467721533
delta = 0.562097718928749
rhonu1 = 1.21135105113532
rhonu = 1.25254716115566

Will keep working on understanding when exactly it occurs. If you know directly what is going on, please let me know. Also, if you have any references for the velocity correction, they might come handy.

Thanks!

P.

Pavel Motloch
Posts: 13
Joined: October 14 2016
Affiliation: CITA

Re: CosmoMC: seg fault for action=2

Post by Pavel Motloch » September 19 2019

In reference https://arxiv.org/pdf/0911.2714.pdf, equation A3, should not the argument of the exponential be sqrt(q^2 + \tilde m^2) instead of q?

Antony Lewis
Posts: 1522
Joined: September 23 2004
Affiliation: University of Sussex
Contact:

Re: CosmoMC: seg fault for action=2

Post by Antony Lewis » September 19 2019

The q looks right: we are assuming the neutrinos are highly relativistic when they decouple and the Fermi-Dirac is frozen in? (i.e. Eq 109 of https://cosmologist.info/teaching/EU/notes_eu1.pdf)

Do you have the input parameters that cause the problem? The neutrino mass looks very low, so maybe there is an issue there.

Antony Lewis
Posts: 1522
Joined: September 23 2004
Affiliation: University of Sussex
Contact:

Re: CosmoMC: seg fault for action=2

Post by Antony Lewis » September 19 2019

OK, looks like there is an issue when 0<m_nu < 0.0005 eV. Probably it should just treat these light cases as exactly relativistic.

Pavel Motloch
Posts: 13
Joined: October 14 2016
Affiliation: CITA

Re: CosmoMC: seg fault for action=2

Post by Pavel Motloch » September 19 2019

Antony Lewis wrote:
September 19 2019
The q looks right: we are assuming the neutrinos are highly relativistic when they decouple and the Fermi-Dirac is frozen in? (i.e. Eq 109 of https://cosmologist.info/teaching/EU/notes_eu1.pdf)

Do you have the input parameters that cause the problem? The neutrino mass looks very low, so maybe there is an issue there.
Right, should be Fermi-Dirac. I did not think it all the way through, thanks!


Pavel Motloch
Posts: 13
Joined: October 14 2016
Affiliation: CITA

Re: CosmoMC: seg fault for action=2

Post by Pavel Motloch » September 20 2019

Thanks.

Before your fix, just added a simple workaround of not correcting for m_nu < 1 meV. Seems to run without problems so far, will stick with that for now.

Post Reply