Checkpoint segmentation fault

Use of Healpix, camb, CLASS, cosmomc, compilers, etc.
Post Reply
Pablo Lemos
Posts: 9
Joined: December 07 2015
Affiliation: Cambridge University

Checkpoint segmentation fault

Post by Pablo Lemos » November 03 2016

Hello,

I am trying to restart a previous CosmoMC run from the checkpoint files. The code was running fine the first time, but when I try to restart it I get the following error message:


2 Reading checkpoint from chains/hm_2.chk
3 Reading checkpoint from chains/hm_3.chk
4 Reading checkpoint from chains/hm_4.chk
1 Reading checkpoint from chains/hm_1.chk
starting Monte-Carlo
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
cosmomc 00000000007E1239 Unknown Unknown Unknown
cosmomc 00000000007DFB0E Unknown Unknown Unknown
cosmomc 0000000000771C52 Unknown Unknown Unknown
cosmomc 000000000070A1F8 Unknown Unknown Unknown
cosmomc 0000000000712C5B Unknown Unknown Unknown
libpthread.so.0 00007F5DF735C7E0 Unknown Unknown Unknown
cosmomc 000000000045EBD6 Unknown Unknown Unknown
cosmomc 000000000045ECE6 Unknown Unknown Unknown
cosmomc 0000000000554FDC Unknown Unknown Unknown
cosmomc 0000000000526AEE Unknown Unknown Unknown
cosmomc 000000000061791A Unknown Unknown Unknown
cosmomc 0000000000616CF7 Unknown Unknown Unknown
cosmomc 0000000000614249 Unknown Unknown Unknown
cosmomc 00000000004FEE8B Unknown Unknown Unknown
cosmomc 00000000005001CF Unknown Unknown Unknown
cosmomc 00000000005012BF Unknown Unknown Unknown
cosmomc 00000000004FF30A Unknown Unknown Unknown
cosmomc 0000000000519544 Unknown Unknown Unknown
cosmomc 000000000061F303 Unknown Unknown Unknown
cosmomc 0000000000439996 Unknown Unknown Unknown
libc.so.6 00007F5DF6FD7D1D Unknown Unknown Unknown
cosmomc 0000000000439889 Unknown Unknown Unknown
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 30538 on
node calx024.ast.cam.ac.uk exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------


This has happened to me on two different sets of checkpoint files. Any idea about what could be causing it? It would be really useful if I could continue from those checkpoint files

Best
Pablo

Antony Lewis
Posts: 1364
Joined: September 23 2004
Affiliation: University of Sussex
Contact:

Re: Checkpoint segmentation fault

Post by Antony Lewis » November 03 2016

Not really, compiling cosmomc_debug instead may give a more helpful traceback.

Pablo Lemos
Posts: 9
Joined: December 07 2015
Affiliation: Cambridge University

Checkpoint segmentation fault

Post by Pablo Lemos » November 03 2016

Thanks for your help Antony. When I run it on debug mode, this is what I get. Do you know what it means?

3 Reading checkpoint from chains/hm_3.chk
4 Reading checkpoint from chains/hm_4.chk
1 Reading checkpoint from chains/hm_1.chk
2 Reading checkpoint from chains/hm_2.chk
starting Monte-Carlo
forrtl: error (65): floating invalid
Image PC Routine Line Source
cosmomc_debug 0000000000E0BD39 Unknown Unknown Unknown
cosmomc_debug 0000000000E0A60E Unknown Unknown Unknown
cosmomc_debug 0000000000DA6F12 Unknown Unknown Unknown
cosmomc_debug 0000000000D3F128 Unknown Unknown Unknown
cosmomc_debug 0000000000D47D61 Unknown Unknown Unknown
libpthread.so.0 00007FC1BCBCD7E0 Unknown Unknown Unknown
cosmomc_debug 0000000000E0263B Unknown Unknown Unknown
cosmomc_debug 0000000000BA15A5 nonlinear_mp_tcb_ 747 halofit_ppf.f90
cosmomc_debug 0000000000B9FD59 nonlinear_mp_fill 680 halofit_ppf.f90
cosmomc_debug 0000000000BA1FCE nonlinear_mp_init 797 halofit_ppf.f90
cosmomc_debug 0000000000B99B33 nonlinear_mp_hmco 384 halofit_ppf.f90
cosmomc_debug 0000000000B95965 nonlinear_mp_nonl 98 halofit_ppf.f90
cosmomc_debug 0000000000C4AE24 cambmain_mp_maken 1157 cmbmain.f90
cosmomc_debug 0000000000C26E8F cambmain_mp_cmbma 224 cmbmain.f90
cosmomc_debug 0000000000CB4215 camb_mp_camb_getr 137 camb.f90
cosmomc_debug 0000000000CB3AA4 camb_mp_camb_gett 46 camb.f90
cosmomc_debug 0000000000767B97 calculator_camb_m 191 Calculator_CAMB.f90
cosmomc_debug 000000000075ED11 calclike_cosmolog 77 CalcLike_Cosmology.f90
cosmomc_debug 0000000000A30BA5 calclike_mp_theor 308 calclike.f90
cosmomc_debug 0000000000A2856B calclike_mp_getlo 146 calclike.f90
cosmomc_debug 00000000006CBF7F montecarlo_mp_tsa 94 MCMC.f90
cosmomc_debug 00000000006D161C montecarlo_mp_tme 279 MCMC.f90
cosmomc_debug 00000000006D42CF montecarlo_mp_tfa 353 MCMC.f90
cosmomc_debug 00000000006CD0F7 montecarlo_mp_tch 144 MCMC.f90
cosmomc_debug 0000000000729193 generalsetup_mp_t 137 GeneralSetup.f90
cosmomc_debug 0000000000A47E97 MAIN__ 268 driver.F90
cosmomc_debug 00000000004392A6 Unknown Unknown Unknown
libc.so.6 00007FC1BC848D1D Unknown Unknown Unknown
cosmomc_debug 0000000000439199 Unknown Unknown Unknown
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 11697 on node calx024.ast.cam.ac.uk exited on signal 6 (Aborted).
--------------------------------------------------------------------------


Best
Pablo

Antony Lewis
Posts: 1364
Joined: September 23 2004
Affiliation: University of Sussex
Contact:

Re: Checkpoint segmentation fault

Post by Antony Lewis » November 08 2016

Not very helpful, the error in halofit could be something more generally wrong in the power spectra.

Post Reply