Unexpected error with MPI Cobaya - "resume: true" option

Use of Cobaya. camb, CLASS, cosmomc, compilers, etc.
Post Reply
Dournac Fabien
Posts: 74
Joined: May 18 2019
Affiliation: IRAP
Contact:

Unexpected error with MPI Cobaya - "resume: true" option

Post by Dournac Fabien » May 18 2022

Hi,

I get an unexpected error (see below) when running Cobaya on SNe launched like this :

Code: Select all

$ nohup mpirun -np 4 cobaya-run sne.yaml &
with content for sne.yaml :

Code: Select all

output: sne_chains/sne
theory:
  camb:
    extra_args:
      halofit_version: mead
      bbn_predictor: PArthENoPE_880.2_standard.dat
      lens_potential_accuracy: 1
      num_massive_neutrinos: 1
      nnu: 3.046
      theta_H0_range:
      - 20
      - 100
likelihood:
  sn.pantheon: null
params:
  logA:
    prior:
      min: 1.61
      max: 3.91
    ref:
      dist: norm
      loc: 3.05
      scale: 0.001
    proposal: 0.001
    latex: \log(10^{10} A_\mathrm{s})
    drop: true
  As:
    value: 'lambda logA: 1e-10*np.exp(logA)'
    latex: A_\mathrm{s}
  ns:
    prior:
      min: 0.8
      max: 1.2
    ref:
      dist: norm
      loc: 0.965
      scale: 0.004
    proposal: 0.002
    latex: n_\mathrm{s}
  theta_MC_100:
    prior:
      min: 0.5
      max: 10
    ref:
      dist: norm
      loc: 1.04109
      scale: 0.0004
    proposal: 0.0002
    latex: 100\theta_\mathrm{MC}
    drop: true
    renames: theta
  cosmomc_theta:
    value: 'lambda theta_MC_100: 1.e-2*theta_MC_100'
    derived: false
  H0:
    latex: H_0
    min: 20
    max: 100
  ombh2:
    prior:
      min: 0.005
      max: 0.1
    ref:
      dist: norm
      loc: 0.0224
      scale: 0.0001
    proposal: 0.0001
    latex: \Omega_\mathrm{b} h^2
  omch2:
    prior:
      min: 0.001
      max: 0.99
    ref:
      dist: norm
      loc: 0.12
      scale: 0.001
    proposal: 0.0005
    latex: \Omega_\mathrm{c} h^2
  omegam:
    latex: \Omega_\mathrm{m}
  omegamh2:
    derived: 'lambda omegam, H0: omegam*(H0/100)**2'
    latex: \Omega_\mathrm{m} h^2
  mnu: 0.06
  omk:
    prior:
      min: -0.3
      max: 0.3
    ref:
      dist: norm
      loc: 0.
      scale: 0.006
    proposal: 0.003
    latex: \Omega_\mathrm{k}
  omega_de:
    latex: \Omega_\Lambda
  YHe:
    latex: Y_\mathrm{P}
  Y_p:
    latex: Y_P^\mathrm{BBN}
  DHBBN:
    derived: 'lambda DH: 10**5*DH'
    latex: 10^5 \mathrm{D}/\mathrm{H}
  tau:
    prior:
      min: 0.01
      max: 0.8
    ref:
      dist: norm
      loc: 0.055
      scale: 0.006
    proposal: 0.003
    latex: \tau_\mathrm{reio}
  zrei:
    latex: z_\mathrm{re}
  sigma8:
    latex: \sigma_8
  s8h5:
    derived: 'lambda sigma8, H0: sigma8*(H0*1e-2)**(-0.5)'
    latex: \sigma_8/h^{0.5}
  s8omegamp5:
    derived: 'lambda sigma8, omegam: sigma8*omegam**0.5'
    latex: \sigma_8 \Omega_\mathrm{m}^{0.5}
  s8omegamp25:
    derived: 'lambda sigma8, omegam: sigma8*omegam**0.25'
    latex: \sigma_8 \Omega_\mathrm{m}^{0.25}
  A:
    derived: 'lambda As: 1e9*As'
    latex: 10^9 A_\mathrm{s}
  clamp:
    derived: 'lambda As, tau: 1e9*As*np.exp(-2*tau)'
    latex: 10^9 A_\mathrm{s} e^{-2\tau}
  age:
    latex: '{\rm{Age}}/\mathrm{Gyr}'
  rdrag:
    latex: r_\mathrm{drag}
sampler:
  mcmc:
    drag: true
    oversample_power: 0.4
    proposal_scale: 1.9
    covmat: auto
    Rminus1_stop: 0.01
    Rminus1_cl_stop: 0.2
resume: true
Here the error until 1 hour of computing :

Code: Select all

[2 : mcmc] Learn + convergence test @ 8400 samples accepted.
[3 : mcmc] Learn + convergence test @ 8000 samples accepted.
[1 : mcmc] Progress @ 2022-05-18 03:12:41 : 11818 steps taken, and 9588 accepted.
[3 : mcmc] Progress @ 2022-05-18 03:12:42 : 11228 steps taken, and 8098 accepted.
[2 : mcmc] Progress @ 2022-05-18 03:12:42 : 11439 steps taken, and 8505 accepted.
[0 : mcmc] Progress @ 2022-05-18 03:12:43 : 10541 steps taken, and 5937 accepted.
[1 : mcmc] Learn + convergence test @ 9600 samples accepted.
[2 : mcmc] Learn + convergence test @ 8600 samples accepted.
[3 : mcmc] Learn + convergence test @ 8200 samples accepted.
[1 : mcmc] Progress @ 2022-05-18 03:13:41 : 11983 steps taken, and 9730 accepted.
[3 : mcmc] Progress @ 2022-05-18 03:13:42 : 11398 steps taken, and 8216 accepted.
[2 : mcmc] Progress @ 2022-05-18 03:13:42 : 11608 steps taken, and 8657 accepted.
[0 : mcmc] Progress @ 2022-05-18 03:13:43 : 10674 steps taken, and 5970 accepted.
[1 : mcmc] Learn + convergence test @ 9800 samples accepted.
[0 : mcmc] Learn + convergence test @ 6000 samples accepted.
[1 : mcmc] Progress @ 2022-05-18 03:14:41 : 12206 steps taken, and 9864 accepted.
[3 : mcmc] Progress @ 2022-05-18 03:14:42 : 11579 steps taken, and 8373 accepted.
[2 : mcmc] Progress @ 2022-05-18 03:14:43 : 11762 steps taken, and 8775 accepted.
[0 : mcmc] Progress @ 2022-05-18 03:14:43 : 10803 steps taken, and 6007 accepted.
[3 : mcmc] Learn + convergence test @ 8400 samples accepted.
[2 : mcmc] Learn + convergence test @ 8800 samples accepted.
[1 : mcmc] Progress @ 2022-05-18 03:15:41 : 12388 steps taken, and 9995 accepted.
[3 : mcmc] Progress @ 2022-05-18 03:15:42 : 11758 steps taken, and 8532 accepted.
[2 : mcmc] Progress @ 2022-05-18 03:15:43 : 11917 steps taken, and 8891 accepted.
[0 : mcmc] Progress @ 2022-05-18 03:15:43 : 10930 steps taken, and 6048 accepted.
[1 : mcmc] Learn + convergence test @ 10000 samples accepted.
[3 : mcmc] Learn + convergence test @ 8600 samples accepted.
[2 : mcmc] Learn + convergence test @ 9000 samples accepted.
[1 : mcmc] Progress @ 2022-05-18 03:16:42 : 12570 steps taken, and 10139 accepted.
[3 : mcmc] Progress @ 2022-05-18 03:16:42 : 11972 steps taken, and 8689 accepted.
[2 : mcmc] Progress @ 2022-05-18 03:16:43 : 12080 steps taken, and 9027 accepted.
[0 : mcmc] Progress @ 2022-05-18 03:16:43 : 11058 steps taken, and 6088 accepted.
[1 : mcmc] Learn + convergence test @ 10200 samples accepted.
[1 : mcmc] *ERROR* Waiting for too long for all chains to be ready. Maybe one of them is stuck or died unexpectedly?
[1 : mcmc] Aborting MPI due to error
Abort(1) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
What I don't undertsand is that option "resume: true" at the end of sne.yaml is expected to take over the computation if there is an error with MPI.

However, the entire computing is stopped and impossible to notify to Cobaya to continue the execution.

Does anyone know a clue/track to fix that behavior ?

Regards

Jesus Torrado
Posts: 36
Joined: April 15 2013
Affiliation: RWTH Aachen
Contact:

Re: Unexpected error with MPI Cobaya - "resume: true" option

Post by Jesus Torrado » May 18 2022

Hi Dournac,

"resume:True" will not start a new run automatically, but it will let you re-run with the same input file so that the old run is continued.

In your case, it seems that chain 0 was stuck. Can you try running it again, checking if it starts where it left (will let you know in the first couple of printed messages) and whether this time it hopefully goes on to converge?

Dournac Fabien
Posts: 74
Joined: May 18 2019
Affiliation: IRAP
Contact:

Re: Unexpected error with MPI Cobaya - "resume: true" option

Post by Dournac Fabien » May 18 2022

Hi Jesus,

I launched again a MPI Cobaya run :

Here the head of log ouput :

Code: Select all

[0 : output] Output to be read-from/written-into folder 'sne_chains', with prefix 'sne'
[0 : output] Found existing info files with the requested output prefix: 'sne_chains/sne'
[0 : output] Let's try to resume/load.
[0 : output] Found an old sample. Resuming.
[0 : CAMB] Importing *auto-installed* CAMB (but defaulting to *global*).
[2 : CAMB] Importing *auto-installed* CAMB (but defaulting to *global*).
[3 : CAMB] Importing *auto-installed* CAMB (but defaulting to *global*).
[1 : CAMB] Importing *auto-installed* CAMB (but defaulting to *global*).
[0 : camb] Initialized!
[2 : camb] Initialized!
[1 : camb] Initialized!
[3 : camb] Initialized!
[0 : mcmc] Resuming from previous sample!
[0 : samplecollection] Loaded 6088 sample points from 'sne_chains/sne.1.txt'
[0 : model] *WARNING* Oversampling would be trivial due to small speed difference or small `oversample_power`. Set to 2.
[0 : mcmc] *WARNING* Dragging disabled: speed ratios < 2.
[0 : mcmc] Covariance matrix from previous sample.
[2 : samplecollection] Loaded 9027 sample points from 'sne_chains/sne.3.txt'
[2 : model] *WARNING* Oversampling would be trivial due to small speed difference or small `oversample_power`. Set to 2.
[2 : mcmc] *WARNING* Dragging disabled: speed ratios < 2.
[3 : samplecollection] Loaded 8689 sample points from 'sne_chains/sne.4.txt'
[3 : model] *WARNING* Oversampling would be trivial due to small speed difference or small `oversample_power`. Set to 2.
[3 : mcmc] *WARNING* Dragging disabled: speed ratios < 2.
[1 : samplecollection] Loaded 10139 sample points from 'sne_chains/sne.2.txt'
[1 : model] *WARNING* Oversampling would be trivial due to small speed difference or small `oversample_power`. Set to 2.
[1 : mcmc] *WARNING* Dragging disabled: speed ratios < 2.
[0 : mcmc] Initial point: logA:3.21832, ns:0.8698616, theta_MC_100:0.9218277, ombh2:0.008974716, omch2:0.04612831, omk:-0.06771252, tau:0.5479178
[2 : mcmc] Initial point: logA:2.909386, ns:0.8037559, theta_MC_100:1.249986, ombh2:0.006290793, omch2:0.1507223, omk:-0.07780796, tau:0.0987209
[1 : mcmc] Initial point: logA:2.995976, ns:1.061477, theta_MC_100:1.089673, ombh2:0.09376229, omch2:0.1520718, omk:-0.08228081, tau:0.2483638
[3 : mcmc] Initial point: logA:3.029198, ns:1.151795, theta_MC_100:1.182734, ombh2:0.06345684, omch2:0.1615017, omk:-0.1143844, tau:0.05874039
[0 : mcmc] Sampling!
[3 : mcmc] Progress @ 2022-05-18 19:41:32 : 1 steps taken, and 8689 accepted.
[2 : mcmc] Progress @ 2022-05-18 19:41:32 : 1 steps taken, and 9027 accepted.
[0 : mcmc] Progress @ 2022-05-18 19:41:32 : 1 steps taken, and 6088 accepted.
[1 : mcmc] Progress @ 2022-05-18 19:41:32 : 1 steps taken, and 10139 accepted.
[3 : mcmc] Progress @ 2022-05-18 19:42:32 : 216 steps taken, and 8868 accepted.
[2 : mcmc] Progress @ 2022-05-18 19:42:32 : 185 steps taken, and 9177 accepted.
[0 : mcmc] Progress @ 2022-05-18 19:42:32 : 140 steps taken, and 6154 accepted.
[1 : mcmc] Progress @ 2022-05-18 19:42:33 : 169 steps taken, and 10282 accepted.
[0 : mcmc] Learn + convergence test @ 6160 samples accepted.
[0 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[2 : mcmc] Learn + convergence test @ 9240 samples accepted.
[2 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[3 : mcmc] Learn + convergence test @ 8960 samples accepted.
[3 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[1 : mcmc] Learn + convergence test @ 10360 samples accepted.
[1 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
This seems to take over where it was stopped ...

Hopefully that it will continue until the convergence.

Regards

ps : I have 64 cores/128 threads available. I took OMP_NUM_THREADS=64 and launched 4 MPI Cobaya processes : isn't it by chance the root of my issue ?

Jesus Torrado
Posts: 36
Joined: April 15 2013
Affiliation: RWTH Aachen
Contact:

Re: Unexpected error with MPI Cobaya - "resume: true" option

Post by Jesus Torrado » May 19 2022

Hi Dournac,
ps : I have 64 cores/128 threads available. I took OMP_NUM_THREADS=64 and launched 4 MPI Cobaya processes : isn't it by chance the root of my issue ?
You should in general not set OMP_NUM_THREADS larger than the number of threads divided by the number of processes, in your case 128 / 4 = 32. So it might be that the processes are choking each other in your case.

Post Reply