Cobaya IO error on more than one node at NERSC

Use of Cobaya. camb, CLASS, cosmomc, compilers, etc.
Post Reply
Stephen Chen
Posts: 3
Joined: July 29 2021
Affiliation: UC Berkeley

Cobaya IO error on more than one node at NERSC

Post by Stephen Chen » July 29 2021

I've been having a weird issue running cobaya on NERSC: on the haswell nodes (cori) if I do something like

srun -N 1 -n 16 -c 4 chain.yaml

in an interactive session with one node requested all is fine, but if I instead request two nodes with

srun -N 2 -n 16 -c 8 chain.yaml

everything runs for a while until it hits a convergence test, at which point it prints something like

[0 : mcmc] All chains are ready to check convergence and learn a new proposal covmat

but then stops short of actually printing any acceptance rates/R-1's. After this it seems to continue generating elements in the chains but stops updating the file written by the 0th process, i.e. chain.1.txt, altogether, in addition to not writing out any updated proposals etc.

Anyone know what I'm doing wrong?

Antony Lewis
Posts: 1936
Joined: September 23 2004
Affiliation: University of Sussex
Contact:

Re: Cobaya IO error on more than one node at NERSC

Post by Antony Lewis » July 29 2021

If you are using Pandas 1.3+ make sure you have the latest Cobaya 3.1.1. Otherwise what was the exact log?

Stephen Chen
Posts: 3
Joined: July 29 2021
Affiliation: UC Berkeley

Re: Cobaya IO error on more than one node at NERSC

Post by Stephen Chen » July 30 2021

There wasn't anything particular on the log file except that the zeroth process stops printing anything after it says ready to check convergence:

Code: Select all

[8 : theorycollection] Average computation time:
   camb.transfers : 1.8527 s (7 evaluations, 12.9689 s total)
   camb : 0.00321901 s (7 evaluations, 0.0225331 s total)
   fs_likelihood_zs.PT_pk_theory_zs : 4.78604 s (7 evaluations, 33.5022 s total)
[8 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[0 : mcmc] All chains are ready to check convergence and learn a new proposal covmat
[5 : mcmc] Progress @ 2021-07-29 13:14:43 : 1920 steps taken, and 68 accepted.
[11 : mcmc] Progress @ 2021-07-29 13:14:43 : 1844 steps taken, and 68 accepted.
[10 : mcmc] Progress @ 2021-07-29 13:14:43 : 1889 steps taken, and 67 accepted.
[7 : mcmc] Progress @ 2021-07-29 13:14:43 : 1705 steps taken, and 62 accepted.
[14 : mcmc] Progress @ 2021-07-29 13:14:43 : 1687 steps taken, and 62 accepted.
[1 : mcmc] Progress @ 2021-07-29 13:14:44 : 1803 steps taken, and 65 accepted.
[6 : mcmc] Progress @ 2021-07-29 13:14:44 : 1812 steps taken, and 66 accepted.
[2 : mcmc] Progress @ 2021-07-29 13:14:44 : 1973 steps taken, and 72 accepted.
[8 : mcmc] Progress @ 2021-07-29 13:14:44 : 1776 steps taken, and 65 accepted.
[9 : mcmc] Progress @ 2021-07-29 13:14:45 : 1856 steps taken, and 67 accepted.
[12 : mcmc] Progress @ 2021-07-29 13:14:45 : 2087 steps taken, and 76 accepted.
[13 : mcmc] Progress @ 2021-07-29 13:14:47 : 2280 steps taken, and 83 accepted.
[15 : mcmc] Progress @ 2021-07-29 13:14:47 : 2227 steps taken, and 80 accepted.
[3 : mcmc] Progress @ 2021-07-29 13:14:48 : 1871 steps taken, and 69 accepted.
[4 : mcmc] Progress @ 2021-07-29 13:14:49 : 1786 steps taken, and 66 accepted.
[4 : mcmc] Learn + convergence test @ 124 samples accepted.
[4 : likelihoodcollection] Average computation time:
   boss_likelihoods.NGCZ3 : 0.00164264 s (3388 evaluations, 5.56525 s total)
   boss_likelihoods.SGCZ3 : 0.00165494 s (3388 evaluations, 5.60694 s total)
   boss_likelihoods.NGCZ1 : 0.0016739 s (3388 evaluations, 5.67117 s total)
   boss_likelihoods.SGCZ1 : 0.00160126 s (3388 evaluations, 5.42505 s total)
[4 : theorycollection] Average computation time:
   camb.transfers : 1.78182 s (13 evaluations, 23.1636 s total)
   camb : 0.0034203 s (13 evaluations, 0.0444639 s total)
   fs_likelihood_zs.PT_pk_theory_zs : 4.57128 s (13 evaluations, 59.4266 s total)
[2 : mcmc] Learn + convergence test @ 124 samples accepted.
[2 : likelihoodcollection] Average computation time:
   boss_likelihoods.NGCZ3 : 0.00171652 s (3349 evaluations, 5.74862 s total)
   boss_likelihoods.SGCZ3 : 0.00170801 s (3349 evaluations, 5.72014 s total)
   boss_likelihoods.NGCZ1 : 0.00172517 s (3349 evaluations, 5.77759 s total)
   boss_likelihoods.SGCZ1 : 0.00171486 s (3349 evaluations, 5.74306 s total)
[2 : theorycollection] Average computation time:
   camb.transfers : 1.86133 s (13 evaluations, 24.1973 s total)
   camb : 0.00332118 s (13 evaluations, 0.0431754 s total)
   fs_likelihood_zs.PT_pk_theory_zs : 4.59433 s (13 evaluations, 59.7263 s total)
[5 : mcmc] Learn + convergence test @ 124 samples accepted.
[5 : likelihoodcollection] Average computation time:
   boss_likelihoods.NGCZ3 : 0.00166453 s (3422 evaluations, 5.69602 s total)
   boss_likelihoods.SGCZ3 : 0.00171437 s (3422 evaluations, 5.86658 s total)
   boss_likelihoods.NGCZ1 : 0.00173858 s (3422 evaluations, 5.94941 s total)
   boss_likelihoods.SGCZ1 : 0.00167216 s (3422 evaluations, 5.72215 s total)
[5 : theorycollection] Average computation time:
   camb.transfers : 1.80986 s (13 evaluations, 23.5282 s total)
   camb : 0.00346066 s (13 evaluations, 0.0449886 s total)
   fs_likelihood_zs.PT_pk_theory_zs : 4.62556 s (13 evaluations, 60.1323 s total)
[3 : mcmc] Learn + convergence test @ 124 samples accepted.
[3 : likelihoodcollection] Average computation time:
   boss_likelihoods.NGCZ3 : 0.00172031 s (3389 evaluations, 5.83012 s total)
   boss_likelihoods.SGCZ3 : 0.00173748 s (3389 evaluations, 5.88833 s total)
   boss_likelihoods.NGCZ1 : 0.00172963 s (3389 evaluations, 5.86171 s total)
   boss_likelihoods.SGCZ1 : 0.00167569 s (3389 evaluations, 5.67893 s total)
[3 : theorycollection] Average computation time:
   camb.transfers : 1.79469 s (13 evaluations, 23.331 s total)
   camb : 0.0035264 s (13 evaluations, 0.0458433 s total)
   fs_likelihood_zs.PT_pk_theory_zs : 4.62014 s (13 evaluations, 60.0619 s total)
[11 : mcmc] Learn + convergence test @ 124 samples accepted.
[11 : likelihoodcollection] Average computation time:
   boss_likelihoods.NGCZ3 : 0.00170583 s (3383 evaluations, 5.77081 s total)
   boss_likelihoods.SGCZ3 : 0.00174491 s (3383 evaluations, 5.90304 s total)
   boss_likelihoods.NGCZ1 : 0.00175022 s (3383 evaluations, 5.92101 s total)
   boss_likelihoods.SGCZ1 : 0.00171552 s (3383 evaluations, 5.80359 s total)
[11 : theorycollection] Average computation time:
   camb.transfers : 1.85116 s (13 evaluations, 24.0651 s total)
   camb : 0.0034926 s (13 evaluations, 0.0454037 s total)
   fs_likelihood_zs.PT_pk_theory_zs : 4.5997 s (13 evaluations, 59.7961 s total)
[10 : mcmc] Learn + convergence test @ 124 samples accepted.
[10 : likelihoodcollection] Average computation time:
   boss_likelihoods.NGCZ3 : 0.00174185 s (3432 evaluations, 5.97803 s total)
   boss_likelihoods.SGCZ3 : 0.00177613 s (3432 evaluations, 6.09567 s total)
   boss_likelihoods.NGCZ1 : 0.00179825 s (3432 evaluations, 6.17161 s total)
   boss_likelihoods.SGCZ1 : 0.00172911 s (3432 evaluations, 5.93432 s total)
[10 : theorycollection] Average computation time:
   camb.transfers : 1.85806 s (13 evaluations, 24.1548 s total)
   camb : 0.00360494 s (13 evaluations, 0.0468642 s total)
   fs_likelihood_zs.PT_pk_theory_zs : 4.58595 s (13 evaluations, 59.6173 s total)
[12 : mcmc] Learn + convergence test @ 124 samples accepted.
[12 : likelihoodcollection] Average computation time:
   boss_likelihoods.NGCZ3 : 0.00178004 s (3353 evaluations, 5.96847 s total)
   boss_likelihoods.SGCZ3 : 0.00186336 s (3353 evaluations, 6.24783 s total)
   boss_likelihoods.NGCZ1 : 0.00188818 s (3353 evaluations, 6.33108 s total)
   boss_likelihoods.SGCZ1 : 0.001822 s (3353 evaluations, 6.10916 s total)
[12 : theorycollection] Average computation time:
   camb.transfers : 1.82906 s (13 evaluations, 23.7778 s total)
   camb : 0.00331918 s (13 evaluations, 0.0431494 s total)
   fs_likelihood_zs.PT_pk_theory_zs : 4.58965 s (13 evaluations, 59.6654 s total)
[6 : mcmc] Learn + convergence test @ 124 samples accepted.
[6 : likelihoodcollection] Average computation time:
   boss_likelihoods.NGCZ3 : 0.00168769 s (3414 evaluations, 5.76176 s total)
   boss_likelihoods.SGCZ3 : 0.00169835 s (3414 evaluations, 5.79816 s total)
   boss_likelihoods.NGCZ1 : 0.00166912 s (3414 evaluations, 5.69836 s total)
   boss_likelihoods.SGCZ1 : 0.00165474 s (3414 evaluations, 5.64927 s total)
[6 : theorycollection] Average computation time:
   camb.transfers : 1.80362 s (13 evaluations, 23.4471 s total)
   camb : 0.00346121 s (13 evaluations, 0.0449958 s total)
   fs_likelihood_zs.PT_pk_theory_zs : 4.74782 s (13 evaluations, 61.7217 s total)
[13 : mcmc] Learn + convergence test @ 124 samples accepted.
[13 : likelihoodcollection] Average computation time:
   boss_likelihoods.NGCZ3 : 0.00182908 s (3399 evaluations, 6.21704 s total)
   boss_likelihoods.SGCZ3 : 0.00191204 s (3399 evaluations, 6.49903 s total)
   boss_likelihoods.NGCZ1 : 0.00195682 s (3399 evaluations, 6.65122 s total)
   boss_likelihoods.SGCZ1 : 0.00188186 s (3399 evaluations, 6.39644 s total)
[13 : theorycollection] Average computation time:
   camb.transfers : 1.81345 s (13 evaluations, 23.5748 s total)
   camb : 0.00323844 s (13 evaluations, 0.0420997 s total)
   fs_likelihood_zs.PT_pk_theory_zs : 4.57634 s (13 evaluations, 59.4925 s total)
[1 : mcmc] Learn + convergence test @ 124 samples accepted.
[1 : likelihoodcollection] Average computation time:
   boss_likelihoods.NGCZ3 : 0.00163155 s (3408 evaluations, 5.56031 s total)
   boss_likelihoods.SGCZ3 : 0.00162348 s (3408 evaluations, 5.53282 s total)
   boss_likelihoods.NGCZ1 : 0.00162209 s (3408 evaluations, 5.52808 s total)
   boss_likelihoods.SGCZ1 : 0.00158082 s (3408 evaluations, 5.38745 s total)
[1 : theorycollection] Average computation time:
   camb.transfers : 1.81128 s (13 evaluations, 23.5467 s total)
   camb : 0.00335596 s (13 evaluations, 0.0436275 s total)
   fs_likelihood_zs.PT_pk_theory_zs : 4.91742 s (13 evaluations, 63.9265 s total)
[14 : mcmc] Learn + convergence test @ 124 samples accepted.
[14 : likelihoodcollection] Average computation time:
   boss_likelihoods.NGCZ3 : 0.00177125 s (3349 evaluations, 5.9319 s total)
   boss_likelihoods.SGCZ3 : 0.00190338 s (3349 evaluations, 6.37441 s total)
   boss_likelihoods.NGCZ1 : 0.00194203 s (3349 evaluations, 6.50386 s total)
   boss_likelihoods.SGCZ1 : 0.00195459 s (3349 evaluations, 6.54593 s total)
[14 : theorycollection] Average computation time:
   camb.transfers : 1.7542 s (13 evaluations, 22.8046 s total)
   camb : 0.00326567 s (13 evaluations, 0.0424537 s total)
   fs_likelihood_zs.PT_pk_theory_zs : 4.7653 s (13 evaluations, 61.9489 s total)
[8 : mcmc] Learn + convergence test @ 124 samples accepted.
[8 : likelihoodcollection] Average computation time:
   boss_likelihoods.NGCZ3 : 0.00182512 s (3357 evaluations, 6.12694 s total)
   boss_likelihoods.SGCZ3 : 0.00187486 s (3357 evaluations, 6.2939 s total)
   boss_likelihoods.NGCZ1 : 0.00189183 s (3357 evaluations, 6.35086 s total)
   boss_likelihoods.SGCZ1 : 0.00170692 s (3357 evaluations, 5.73014 s total)
[8 : theorycollection] Average computation time:
   camb.transfers : 1.85158 s (13 evaluations, 24.0705 s total)
   camb : 0.00324191 s (13 evaluations, 0.0421449 s total)
   fs_likelihood_zs.PT_pk_theory_zs : 4.7709 s (13 evaluations, 62.0217 s total)
[15 : mcmc] Learn + convergence test @ 124 samples accepted.
[15 : likelihoodcollection] Average computation time:
   boss_likelihoods.NGCZ3 : 0.00178258 s (3434 evaluations, 6.12139 s total)
   boss_likelihoods.SGCZ3 : 0.00184182 s (3434 evaluations, 6.32481 s total)
   boss_likelihoods.NGCZ1 : 0.00183351 s (3434 evaluations, 6.29628 s total)
   boss_likelihoods.SGCZ1 : 0.00175728 s (3434 evaluations, 6.03451 s total)
[15 : theorycollection] Average computation time:
   camb.transfers : 1.79387 s (13 evaluations, 23.3203 s total)
   camb : 0.00334346 s (13 evaluations, 0.0434649 s total)
   fs_likelihood_zs.PT_pk_theory_zs : 4.83462 s (13 evaluations, 62.8501 s total)
[9 : mcmc] Learn + convergence test @ 124 samples accepted.
[9 : likelihoodcollection] Average computation time:
   boss_likelihoods.NGCZ3 : 0.00176841 s (3434 evaluations, 6.07272 s total)
   boss_likelihoods.SGCZ3 : 0.0018425 s (3434 evaluations, 6.32713 s total)
   boss_likelihoods.NGCZ1 : 0.00189659 s (3434 evaluations, 6.51288 s total)
   boss_likelihoods.SGCZ1 : 0.00187378 s (3434 evaluations, 6.43456 s total)
[9 : theorycollection] Average computation time:
   camb.transfers : 1.7987 s (13 evaluations, 23.383 s total)
   camb : 0.00336386 s (13 evaluations, 0.0437302 s total)
   fs_likelihood_zs.PT_pk_theory_zs : 4.80307 s (13 evaluations, 62.4399 s total)
[7 : mcmc] Learn + convergence test @ 124 samples accepted.
[7 : likelihoodcollection] Average computation time:
   boss_likelihoods.NGCZ3 : 0.00161761 s (3435 evaluations, 5.55648 s total)
   boss_likelihoods.SGCZ3 : 0.00161833 s (3435 evaluations, 5.55896 s total)
   boss_likelihoods.NGCZ1 : 0.00164705 s (3435 evaluations, 5.65763 s total)
   boss_likelihoods.SGCZ1 : 0.0016656 s (3435 evaluations, 5.72134 s total)
[7 : theorycollection] Average computation time:
   camb.transfers : 1.85829 s (14 evaluations, 26.0161 s total)
   camb : 0.00343564 s (14 evaluations, 0.0480989 s total)
   fs_likelihood_zs.PT_pk_theory_zs : 4.89066 s (14 evaluations, 68.4692 s total)
[5 : mcmc] Progress @ 2021-07-29 13:15:43 : 3840 steps taken, and 139 accepted.
[2 : mcmc] Progress @ 2021-07-29 13:15:44 : 4030 steps taken, and 148 accepted.
[7 : mcmc] Progress @ 2021-07-29 13:15:44 : 3576 steps taken, and 129 accepted.
and the actual chain file with all the steps also stops updating.

The actual command I ran was

srun -N 2 -n 16 -c 8 cobaya-run test_two_cores.yaml

Antony Lewis
Posts: 1936
Joined: September 23 2004
Affiliation: University of Sussex
Contact:

Re: Cobaya IO error on more than one node at NERSC

Post by Antony Lewis » July 30 2021

No sure, it is version 3.1.1? (normally you don't need to run so many chains, I usually run 4-8)

Stephen Chen
Posts: 3
Joined: July 29 2021
Affiliation: UC Berkeley

Re: Cobaya IO error on more than one node at NERSC

Post by Stephen Chen » July 31 2021

This is 3.1.0... should I upgrade?

On an unrelated note I've been having issues where my chains fail around R-1 >~1 (so after it's started learning the proposal) because they are unable to find new points after many tries... any ideas?

Antony Lewis
Posts: 1936
Joined: September 23 2004
Affiliation: University of Sussex
Contact:

Re: Cobaya IO error on more than one node at NERSC

Post by Antony Lewis » July 31 2021

Yes upgrade

Post Reply