Unexpected error with MPI Cobaya - "resume: true" option
Posted: May 18 2022
Hi,
I get an unexpected error (see below) when running Cobaya on SNe launched like this :
with content for sne.yaml :
Here the error until 1 hour of computing :
What I don't undertsand is that option "resume: true" at the end of sne.yaml is expected to take over the computation if there is an error with MPI.
However, the entire computing is stopped and impossible to notify to Cobaya to continue the execution.
Does anyone know a clue/track to fix that behavior ?
Regards
I get an unexpected error (see below) when running Cobaya on SNe launched like this :
Code: Select all
$ nohup mpirun -np 4 cobaya-run sne.yaml &
Code: Select all
output: sne_chains/sne
theory:
camb:
extra_args:
halofit_version: mead
bbn_predictor: PArthENoPE_880.2_standard.dat
lens_potential_accuracy: 1
num_massive_neutrinos: 1
nnu: 3.046
theta_H0_range:
- 20
- 100
likelihood:
sn.pantheon: null
params:
logA:
prior:
min: 1.61
max: 3.91
ref:
dist: norm
loc: 3.05
scale: 0.001
proposal: 0.001
latex: \log(10^{10} A_\mathrm{s})
drop: true
As:
value: 'lambda logA: 1e-10*np.exp(logA)'
latex: A_\mathrm{s}
ns:
prior:
min: 0.8
max: 1.2
ref:
dist: norm
loc: 0.965
scale: 0.004
proposal: 0.002
latex: n_\mathrm{s}
theta_MC_100:
prior:
min: 0.5
max: 10
ref:
dist: norm
loc: 1.04109
scale: 0.0004
proposal: 0.0002
latex: 100\theta_\mathrm{MC}
drop: true
renames: theta
cosmomc_theta:
value: 'lambda theta_MC_100: 1.e-2*theta_MC_100'
derived: false
H0:
latex: H_0
min: 20
max: 100
ombh2:
prior:
min: 0.005
max: 0.1
ref:
dist: norm
loc: 0.0224
scale: 0.0001
proposal: 0.0001
latex: \Omega_\mathrm{b} h^2
omch2:
prior:
min: 0.001
max: 0.99
ref:
dist: norm
loc: 0.12
scale: 0.001
proposal: 0.0005
latex: \Omega_\mathrm{c} h^2
omegam:
latex: \Omega_\mathrm{m}
omegamh2:
derived: 'lambda omegam, H0: omegam*(H0/100)**2'
latex: \Omega_\mathrm{m} h^2
mnu: 0.06
omk:
prior:
min: -0.3
max: 0.3
ref:
dist: norm
loc: 0.
scale: 0.006
proposal: 0.003
latex: \Omega_\mathrm{k}
omega_de:
latex: \Omega_\Lambda
YHe:
latex: Y_\mathrm{P}
Y_p:
latex: Y_P^\mathrm{BBN}
DHBBN:
derived: 'lambda DH: 10**5*DH'
latex: 10^5 \mathrm{D}/\mathrm{H}
tau:
prior:
min: 0.01
max: 0.8
ref:
dist: norm
loc: 0.055
scale: 0.006
proposal: 0.003
latex: \tau_\mathrm{reio}
zrei:
latex: z_\mathrm{re}
sigma8:
latex: \sigma_8
s8h5:
derived: 'lambda sigma8, H0: sigma8*(H0*1e-2)**(-0.5)'
latex: \sigma_8/h^{0.5}
s8omegamp5:
derived: 'lambda sigma8, omegam: sigma8*omegam**0.5'
latex: \sigma_8 \Omega_\mathrm{m}^{0.5}
s8omegamp25:
derived: 'lambda sigma8, omegam: sigma8*omegam**0.25'
latex: \sigma_8 \Omega_\mathrm{m}^{0.25}
A:
derived: 'lambda As: 1e9*As'
latex: 10^9 A_\mathrm{s}
clamp:
derived: 'lambda As, tau: 1e9*As*np.exp(-2*tau)'
latex: 10^9 A_\mathrm{s} e^{-2\tau}
age:
latex: '{\rm{Age}}/\mathrm{Gyr}'
rdrag:
latex: r_\mathrm{drag}
sampler:
mcmc:
drag: true
oversample_power: 0.4
proposal_scale: 1.9
covmat: auto
Rminus1_stop: 0.01
Rminus1_cl_stop: 0.2
resume: true
Code: Select all
[2 : mcmc] Learn + convergence test @ 8400 samples accepted.
[3 : mcmc] Learn + convergence test @ 8000 samples accepted.
[1 : mcmc] Progress @ 2022-05-18 03:12:41 : 11818 steps taken, and 9588 accepted.
[3 : mcmc] Progress @ 2022-05-18 03:12:42 : 11228 steps taken, and 8098 accepted.
[2 : mcmc] Progress @ 2022-05-18 03:12:42 : 11439 steps taken, and 8505 accepted.
[0 : mcmc] Progress @ 2022-05-18 03:12:43 : 10541 steps taken, and 5937 accepted.
[1 : mcmc] Learn + convergence test @ 9600 samples accepted.
[2 : mcmc] Learn + convergence test @ 8600 samples accepted.
[3 : mcmc] Learn + convergence test @ 8200 samples accepted.
[1 : mcmc] Progress @ 2022-05-18 03:13:41 : 11983 steps taken, and 9730 accepted.
[3 : mcmc] Progress @ 2022-05-18 03:13:42 : 11398 steps taken, and 8216 accepted.
[2 : mcmc] Progress @ 2022-05-18 03:13:42 : 11608 steps taken, and 8657 accepted.
[0 : mcmc] Progress @ 2022-05-18 03:13:43 : 10674 steps taken, and 5970 accepted.
[1 : mcmc] Learn + convergence test @ 9800 samples accepted.
[0 : mcmc] Learn + convergence test @ 6000 samples accepted.
[1 : mcmc] Progress @ 2022-05-18 03:14:41 : 12206 steps taken, and 9864 accepted.
[3 : mcmc] Progress @ 2022-05-18 03:14:42 : 11579 steps taken, and 8373 accepted.
[2 : mcmc] Progress @ 2022-05-18 03:14:43 : 11762 steps taken, and 8775 accepted.
[0 : mcmc] Progress @ 2022-05-18 03:14:43 : 10803 steps taken, and 6007 accepted.
[3 : mcmc] Learn + convergence test @ 8400 samples accepted.
[2 : mcmc] Learn + convergence test @ 8800 samples accepted.
[1 : mcmc] Progress @ 2022-05-18 03:15:41 : 12388 steps taken, and 9995 accepted.
[3 : mcmc] Progress @ 2022-05-18 03:15:42 : 11758 steps taken, and 8532 accepted.
[2 : mcmc] Progress @ 2022-05-18 03:15:43 : 11917 steps taken, and 8891 accepted.
[0 : mcmc] Progress @ 2022-05-18 03:15:43 : 10930 steps taken, and 6048 accepted.
[1 : mcmc] Learn + convergence test @ 10000 samples accepted.
[3 : mcmc] Learn + convergence test @ 8600 samples accepted.
[2 : mcmc] Learn + convergence test @ 9000 samples accepted.
[1 : mcmc] Progress @ 2022-05-18 03:16:42 : 12570 steps taken, and 10139 accepted.
[3 : mcmc] Progress @ 2022-05-18 03:16:42 : 11972 steps taken, and 8689 accepted.
[2 : mcmc] Progress @ 2022-05-18 03:16:43 : 12080 steps taken, and 9027 accepted.
[0 : mcmc] Progress @ 2022-05-18 03:16:43 : 11058 steps taken, and 6088 accepted.
[1 : mcmc] Learn + convergence test @ 10200 samples accepted.
[1 : mcmc] *ERROR* Waiting for too long for all chains to be ready. Maybe one of them is stuck or died unexpectedly?
[1 : mcmc] Aborting MPI due to error
Abort(1) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
However, the entire computing is stopped and impossible to notify to Cobaya to continue the execution.
Does anyone know a clue/track to fix that behavior ?
Regards