Checkpoint error

Use of Healpix, camb, CLASS, cosmomc, compilers, etc.
Post Reply
Zhaoyi Zhou
Posts: 10
Joined: September 22 2018
Affiliation: Shandong University

Checkpoint error

Post by Zhaoyi Zhou » October 09 2018

I install the latest CosmoMC by this helps me successfully compile CosmoMC.
But I set checkpoint to T(true) in test.ini because I want to check if I can stop the program and then resume.When I run

Code: Select all

mpirun -np 2 ./cosmomc test.ini
It run successfully. But I input ctrl+c to break the program. And then I run

Code: Select all

mpirun -np 2 ./cosmomc test.ini
again, the program report:

Code: Select all

  Number of MPI processes:           2
 file_root:test
 Random seeds:  7944, 16851 rand_inst:   1
 Random seeds:  8044, 16851 rand_inst:   2
 Using clik with likelihood file ./data/clik/hi_l/plik/plik_dx11dr2_HM_v18_TT.clik
----
clik version 723c1a4b0580
  smica
----
clik version 723c1a4b0580
  smica
Checking likelihood './data/clik/hi_l/plik/plik_dx11dr2_HM_v18_TT.clik' on test data. got -380.979 expected -380.979 (diff -8.6809e-09)
----
   TT from l=0 to l=        2508
Checking likelihood './data/clik/hi_l/plik/plik_dx11dr2_HM_v18_TT.clik' on test data. got -380.979 expected -380.979 (diff -8.6809e-09)
----
   TT from l=0 to l=        2508
 Clik will run with the following nuisance parameters:
 A_cib_217
 cib_index
 xi_sz_cib
 A_sz
 ps_A_100_100
 ps_A_143_143
 ps_A_143_217
 ps_A_217_217
 ksz_norm
 gal545_A_100
 gal545_A_143
 gal545_A_143_217
 gal545_A_217
 calib_100T
 calib_217T
 A_planck
 Using clik with likelihood file ./data/clik/low_l/commander/commander_rc2_v1.1_l2_29_B.clik
----
clik version 723c1a4b0580
  gibbs_gauss 1478fb2d-28fa-49ac-a8ae-677dbdc3600a
Checking likelihood './data/clik/low_l/commander/commander_rc2_v1.1_l2_29_B.clik' on test data. got -7.32304 expected -7.32304 (diff -2.52096e-10)
----
   TT from l=0 to l=          29
 Clik will run with the following nuisance parameters:
 A_planck
----
clik version 723c1a4b0580
  gibbs_gauss 1478fb2d-28fa-49ac-a8ae-677dbdc3600a
 Doing non-linear Pk: F
 Doing CMB lensing: T
 Doing non-linear lensing: T
 TT lmax =  2508
 EE lmax =  2500
 ET lmax =  2500
 BB lmax =  2500
 PP lmax =  2500
 lmax_computed_cl  =  2508
 Computing tensors: F
 max_eta_k         =    14000.0000    
 transfer kmax     =    5.00000000    
Checking likelihood './data/clik/low_l/commander/commander_rc2_v1.1_l2_29_B.clik' on test data. got -7.32304 expected -7.32304 (diff -2.52096e-10)
----
   TT from l=0 to l=          29
 adding parameters for: smicadx12_Dec5_ftl_mv2_ndclpp_p_teb_consext8
 adding parameters for: commander_rc2_v1.1_l2_29_B
 adding parameters for: BK14_dust
 adding parameters for: plik_dx11dr2_HM_v18_TT
 Fast divided into            2  blocks
 Block breaks at:           15
 28 parameters ( 7 slow ( 0 semi-slow), 21 fast ( 0 semi-fast))
           2 Reading checkpoint from chains/test_2.chk
           1 Reading checkpoint from chains/test_1.chk
 starting Monte-Carlo
 Chain           2  MPI communicating
 TRealArrayList: object of wrong type
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 0 on
node narip-pc exiting improperly. There are three reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

You can avoid this message by specifying -quiet on the mpirun command line.

How can I solve this problem for I just want to resume from where I stop? Should I just run

Code: Select all

mpirun -quiet -np 2 ./cosmomc test.ini
? Will this just resume?

Zhaoyi Zhou
Posts: 10
Joined: September 22 2018
Affiliation: Shandong University

Re: Checkpoint error

Post by Zhaoyi Zhou » October 09 2018

I run

Code: Select all

mpirun -quiet -np 2 ./cosmomc test.ini
It just remove the message below the error, but don't go further.

Code: Select all

Number of MPI processes:           2
 file_root:test
 Random seeds: 23480,  3500 rand_inst:   1
 Random seeds: 23580,  3500 rand_inst:   2
 Using clik with likelihood file ./data/clik/hi_l/plik/plik_dx11dr2_HM_v18_TT.clik
----
clik version 723c1a4b0580
  smica
Checking likelihood './data/clik/hi_l/plik/plik_dx11dr2_HM_v18_TT.clik' on test data. got -380.979 expected -380.979 (diff -8.6809e-09)
----
   TT from l=0 to l=        2508
----
clik version 723c1a4b0580
  gibbs_gauss 1478fb2d-28fa-49ac-a8ae-677dbdc3600a
Checking likelihood './data/clik/low_l/commander/commander_rc2_v1.1_l2_29_B.clik' on test data. got -7.32304 expected -7.32304 (diff -2.52096e-10)
----
   TT from l=0 to l=          29
----
clik version 723c1a4b0580
  smica
Checking likelihood './data/clik/hi_l/plik/plik_dx11dr2_HM_v18_TT.clik' on test data. got -380.979 expected -380.979 (diff -8.6809e-09)
----
   TT from l=0 to l=        2508
 Clik will run with the following nuisance parameters:
 A_cib_217
 cib_index
 xi_sz_cib
 A_sz
 ps_A_100_100
 ps_A_143_143
 ps_A_143_217
 ps_A_217_217
 ksz_norm
 gal545_A_100
 gal545_A_143
 gal545_A_143_217
 gal545_A_217
 calib_100T
 calib_217T
 A_planck
 Using clik with likelihood file ./data/clik/low_l/commander/commander_rc2_v1.1_l2_29_B.clik
----
clik version 723c1a4b0580
  gibbs_gauss 1478fb2d-28fa-49ac-a8ae-677dbdc3600a
Checking likelihood './data/clik/low_l/commander/commander_rc2_v1.1_l2_29_B.clik' on test data. got -7.32304 expected -7.32304 (diff -2.52096e-10)
----
   TT from l=0 to l=          29
 Clik will run with the following nuisance parameters:
 A_planck
 Doing non-linear Pk: F
 Doing CMB lensing: T
 Doing non-linear lensing: T
 TT lmax =  2508
 EE lmax =  2500
 ET lmax =  2500
 BB lmax =  2500
 PP lmax =  2500
 lmax_computed_cl  =  2508
 Computing tensors: F
 max_eta_k         =    14000.0000    
 transfer kmax     =    5.00000000    
 adding parameters for: smicadx12_Dec5_ftl_mv2_ndclpp_p_teb_consext8
 adding parameters for: commander_rc2_v1.1_l2_29_B
 adding parameters for: BK14_dust
 adding parameters for: plik_dx11dr2_HM_v18_TT
 Fast divided into            2  blocks
 Block breaks at:           15
 28 parameters ( 7 slow ( 0 semi-slow), 21 fast ( 0 semi-fast))
           2 Reading checkpoint from chains/test_2.chk
           1 Reading checkpoint from chains/test_1.chk
 starting Monte-Carlo
 Chain           2  MPI communicating
 TRealArrayList: object of wrong type
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG
I've seen some similar problems in cosmoffice, should I change the file MCMC.f90? The how should I change for I can't find where to change bacause I install CosmoMC from git(previos answer).

Zhaoyi Zhou
Posts: 10
Joined: September 22 2018
Affiliation: Shandong University

Re: Checkpoint error

Post by Zhaoyi Zhou » October 09 2018

I see the same problem in this link:viewtopic.php?f=11&t=2827&p=7905&hilit=checkpoint#p7905. And I compile CosmoMC with gfortran too! But my vision is 7.3.0, I have no idea if this is really have something to do with gfortran, which means I should recompile CosmoMC with ifort(I used to compile in this way but I meet some problem in compiling so I use gfortran later.)?

Antony Lewis
Posts: 1394
Joined: September 23 2004
Affiliation: University of Sussex
Contact:

Re: Checkpoint error

Post by Antony Lewis » October 10 2018

Looks like this was a compiler incompatibility with some things in ObjectLists.f90; I re-wrote a bit of it to hopefully avoid the issue - try the latest commit on github master.

Zhaoyi Zhou
Posts: 10
Joined: September 22 2018
Affiliation: Shandong University

Re: Checkpoint error

Post by Zhaoyi Zhou » October 11 2018

I replaced the wl.f90 in my source file with the changed one in github, It still went wrong. The error message:

Code: Select all

Number of MPI processes:           2
 file_root:test
 Random seeds:  8016, 11268 rand_inst:   2
 Random seeds:  7916, 11268 rand_inst:   1
----
clik version 723c1a4b0580
  smica
Checking likelihood './data/clik/hi_l/plik/plik_dx11dr2_HM_v18_TT.clik' on test data. got -380.979 expected -380.979 (diff -8.6809e-09)
----
   TT from l=0 to l=        2508
 Using clik with likelihood file ./data/clik/hi_l/plik/plik_dx11dr2_HM_v18_TT.clik
----
clik version 723c1a4b0580
  gibbs_gauss 1478fb2d-28fa-49ac-a8ae-677dbdc3600a
Checking likelihood './data/clik/low_l/commander/commander_rc2_v1.1_l2_29_B.clik' on test data. got -7.32304 expected -7.32304 (diff -2.52096e-10)
----
   TT from l=0 to l=          29
----
clik version 723c1a4b0580
  smica
           2 Reading checkpoint from chains/test_2.chk
Checking likelihood './data/clik/hi_l/plik/plik_dx11dr2_HM_v18_TT.clik' on test data. got -380.979 expected -380.979 (diff -8.6809e-09)
----
   TT from l=0 to l=        2508
 Clik will run with the following nuisance parameters:
 A_cib_217
 cib_index
 xi_sz_cib
 A_sz
 ps_A_100_100
 ps_A_143_143
 ps_A_143_217
 ps_A_217_217
 ksz_norm
 gal545_A_100
 gal545_A_143
 gal545_A_143_217
 gal545_A_217
 calib_100T
 calib_217T
 A_planck
 Using clik with likelihood file ./data/clik/low_l/commander/commander_rc2_v1.1_l2_29_B.clik
----
clik version 723c1a4b0580
  gibbs_gauss 1478fb2d-28fa-49ac-a8ae-677dbdc3600a
Checking likelihood './data/clik/low_l/commander/commander_rc2_v1.1_l2_29_B.clik' on test data. got -7.32304 expected -7.32304 (diff -2.52096e-10)
----
   TT from l=0 to l=          29
 Clik will run with the following nuisance parameters:
 A_planck
 Doing non-linear Pk: F
 Doing CMB lensing: T
 Doing non-linear lensing: T
 TT lmax =  2508
 EE lmax =  2500
 ET lmax =  2500
 BB lmax =  2500
 PP lmax =  2500
 lmax_computed_cl  =  2508
 Computing tensors: F
 max_eta_k         =    14000.0000    
 transfer kmax     =    5.00000000    
 adding parameters for: smicadx12_Dec5_ftl_mv2_ndclpp_p_teb_consext8
 adding parameters for: commander_rc2_v1.1_l2_29_B
 adding parameters for: BK14_dust
 adding parameters for: plik_dx11dr2_HM_v18_TT
 Fast divided into            2  blocks
 Block breaks at:           15
 28 parameters ( 7 slow ( 0 semi-slow), 21 fast ( 0 semi-fast))
           1 Reading checkpoint from chains/test_1.chk
 starting Monte-Carlo
 Chain           2  MPI communicating
 TRealArrayList: object of wrong type
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 0 on
node narip-pc exiting improperly. There are three reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

You can avoid this message by specifying -quiet on the mpirun command line.
--------------------------------------------------------------------------
It seems that it only read one .chk while the previous error message read two .chk for I run 2 cores command. And it's still TRealArrayList for wrong type.

Antony Lewis
Posts: 1394
Joined: September 23 2004
Affiliation: University of Sussex
Contact:

Re: Checkpoint error

Post by Antony Lewis » October 11 2018

Did you delete all old (possibly corrupted) .chk files?

Zhaoyi Zhou
Posts: 10
Joined: September 22 2018
Affiliation: Shandong University

Re: Checkpoint error

Post by Zhaoyi Zhou » October 11 2018

I didn't delete anything in my chains file. test_1.chk and test_2.chk is still here. I then run it again, though I found that it read from test_1.chk and test_2.chk both again, but it still went wrong:

Code: Select all

<pre>Number of MPI processes:           2
 file_root:test
 Random seeds: 27627, 27709 rand_inst:   2
 Random seeds: 27527, 27709 rand_inst:   1
 Using clik with likelihood file ./data/clik/hi_l/plik/plik_dx11dr2_HM_v18_TT.clik
----
clik version 723c1a4b0580
  smica
Checking likelihood &apos;./data/clik/hi_l/plik/plik_dx11dr2_HM_v18_TT.clik&apos; on test data. got -380.979 expected -380.979 (diff -8.6809e-09)
----
   TT from l=0 to l=        2508
----
clik version 723c1a4b0580
  gibbs_gauss 1478fb2d-28fa-49ac-a8ae-677dbdc3600a
Checking likelihood &apos;./data/clik/low_l/commander/commander_rc2_v1.1_l2_29_B.clik&apos; on test data. got -7.32304 expected -7.32304 (diff -2.52096e-10)
----
   TT from l=0 to l=          29
----
clik version 723c1a4b0580
  smica
Checking likelihood &apos;./data/clik/hi_l/plik/plik_dx11dr2_HM_v18_TT.clik&apos; on test data. got -380.979 expected -380.979 (diff -8.6809e-09)
----
   TT from l=0 to l=        2508
 Clik will run with the following nuisance parameters:
 A_cib_217
 cib_index
 xi_sz_cib
 A_sz
 ps_A_100_100
 ps_A_143_143
 ps_A_143_217
 ps_A_217_217
 ksz_norm
 gal545_A_100
 gal545_A_143
 gal545_A_143_217
 gal545_A_217
 calib_100T
 calib_217T
 A_planck
 Using clik with likelihood file ./data/clik/low_l/commander/commander_rc2_v1.1_l2_29_B.clik
----
clik version 723c1a4b0580
  gibbs_gauss 1478fb2d-28fa-49ac-a8ae-677dbdc3600a
Checking likelihood &apos;./data/clik/low_l/commander/commander_rc2_v1.1_l2_29_B.clik&apos; on test data. got -7.32304 expected -7.32304 (diff -2.52096e-10)
----
   TT from l=0 to l=          29
 Clik will run with the following nuisance parameters:
 A_planck
 Doing non-linear Pk: F
 Doing CMB lensing: T
 Doing non-linear lensing: T
 TT lmax =  2508
 EE lmax =  2500
 ET lmax =  2500
 BB lmax =  2500
 PP lmax =  2500
 lmax_computed_cl  =  2508
 Computing tensors: F
 max_eta_k         =    14000.0000    
 transfer kmax     =    5.00000000    
 adding parameters for: smicadx12_Dec5_ftl_mv2_ndclpp_p_teb_consext8
 adding parameters for: commander_rc2_v1.1_l2_29_B
 adding parameters for: BK14_dust
 adding parameters for: plik_dx11dr2_HM_v18_TT
 Fast divided into            2  blocks
 Block breaks at:           15
 28 parameters ( 7 slow ( 0 semi-slow), 21 fast ( 0 semi-fast))
           2 Reading checkpoint from chains/test_2.chk
           1 Reading checkpoint from chains/test_1.chk
 starting Monte-Carlo
 Chain           2  MPI communicating
 TRealArrayList: object of wrong type
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 0 on
node narip-pc exiting improperly. There are three reasons this could occur:

1. this process did not call &quot;init&quot; before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call &quot;init&quot;. By rule, if one process calls &quot;init&quot;,
then ALL processes must call &quot;init&quot; prior to termination.

2. this process called &quot;init&quot;, but exited without calling &quot;finalize&quot;.
By rule, all processes that call &quot;init&quot; MUST call &quot;finalize&quot; prior to
exiting or it will be considered an &quot;abnormal termination&quot;

3. this process called &quot;MPI_Abort&quot; or &quot;orte_abort&quot; and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

You can avoid this message by specifying -quiet on the mpirun command line.
--------------------------------------------------------------------------
</pre>
If wl.f90 is really not wrong, can I copy the chains to another place and delete CosmoMC then recompile it with gfortran, and put the chains back?(I saw the CosmcMC in github update again on wl.f90 and camb/halofit_ppf.f90)

Antony Lewis
Posts: 1394
Joined: September 23 2004
Affiliation: University of Sussex
Contact:

Re: Checkpoint error

Post by Antony Lewis » October 12 2018

I did fix a (unconnected) bug in wl.f90) as well. First I would try deleting all old .chk files.

Zhaoyi Zhou
Posts: 10
Joined: September 22 2018
Affiliation: Shandong University

Re: Checkpoint error

Post by Zhaoyi Zhou » October 14 2018

You mean delete .chk files? Of course that will run, but doesn't it means start running from beginning? But I want to resume. Sorry I'm not very understand your meaning..

Antony Lewis
Posts: 1394
Joined: September 23 2004
Affiliation: University of Sussex
Contact:

Re: Checkpoint error

Post by Antony Lewis » October 15 2018

Yes, you'll need to rerun (but then checkinging should work OK after the fix) - if the .chk files are corrupted by the bug you won't be able to restart from the old ones.

Antony Lewis
Posts: 1394
Joined: September 23 2004
Affiliation: University of Sussex
Contact:

Re: Checkpoint error

Post by Antony Lewis » October 16 2018

I just fixed another bug which might be affecting this.

Post Reply