I work with a cluster with a Rocky Linux OS, which have a SLURM file management. I'm having a 'Segmentation fault' message error that ends the job before convergence.
Here is the end part of the error message
Code: Select all
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 4 with PID 0 on node node3 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
I tried the following .slurm script
Code: Select all
#!/bin/bash
#SBATCH --job-name=vitor-desi #Nome job com o nome do aluno/prof
#SBATCH --nodes=1 #Numero de Nós
#SBATCH --ntasks=4 #Numero total de tarefas MPI/OPENMP
#SBATCH --partition thread-long #Fila (partition) a ser utilizada
#SBATCH --output %x-slurm_job-%j.out
#SBATCH --error %x-slurm_job-%j.err
#SBATCH --time 72:00:00
#SBATCH --exclusive
## Comando
module load openmpi-4.1.3-gcc-11.3.0-ecg62zg
module load python-3.9.12-gcc-11.3.0-dqhcoft
module load netlib-lapack-3.10.1-gcc-11.3.0-w4ll4ab
module load openblas-0.3.20-gcc-11.3.0-hehyjx5
export OMP_NUM_THREADS=8
export OMP_PLACES=threads
export OMP_PROC_BIND=spread
time mpirun -n 4 -c 8 cobaya-run desi.yaml
echo 'done!'
Here is part of the output:
Code: Select all
[0 : output] Output to be read-from/written-into folder 'chains', with prefix 'desi'
[0 : classy] `classy` module loaded successfully from /home/valerio/Vitor/code/classy/python/build/lib.linux-x86_64-3.9
Loading ACT DR6 lensing likelihood v1.2...
Loading ACT DR6 lensing likelihood v1.2...
Loading ACT DR6 lensing likelihood v1.2...
Loading ACT DR6 lensing likelihood v1.2...
Loading ACT DR6 lensing likelihood v1.2...
Loading ACT DR6 lensing likelihood v1.2...
Loading ACT DR6 lensing likelihood v1.2...
Loading ACT DR6 lensing likelihood v1.2...
[1 : bao.desi_2024_bao_all] Initialized.
[5 : bao.desi_2024_bao_all] Initialized.
[0 : bao.desi_2024_bao_all] Initialized.
[3 : bao.desi_2024_bao_all] Initialized.
[2 : bao.desi_2024_bao_all] Initialized.
[7 : bao.desi_2024_bao_all] Initialized.
[6 : bao.desi_2024_bao_all] Initialized.
[4 : bao.desi_2024_bao_all] Initialized.
[0 : planck_npipe_highl_camspec.ttteee] L-range for 143x143: 30 2000
[0 : planck_npipe_highl_camspec.ttteee] L-range for 217x217: 500 2500
[0 : planck_npipe_highl_camspec.ttteee] L-range for 143x217: 500 2500
[0 : planck_npipe_highl_camspec.ttteee] L-range for TE: 30 2000
[0 : planck_npipe_highl_camspec.ttteee] L-range for EE: 30 2000
[0 : planck_npipe_highl_camspec.ttteee] Number of data points: 9915
[0 : mcmc] Getting initial point... (this may take a few seconds)
[1 : mcmc] Getting initial point... (this may take a few seconds)
[3 : mcmc] Getting initial point... (this may take a few seconds)
[5 : mcmc] Getting initial point... (this may take a few seconds)
[7 : mcmc] Getting initial point... (this may take a few seconds)
[2 : mcmc] Getting initial point... (this may take a few seconds)
[6 : mcmc] Getting initial point... (this may take a few seconds)
[4 : mcmc] Getting initial point... (this may take a few seconds)
[7 : mcmc] Initial point: logA:3.050686, n_s:0.9669852, theta_s_100:1.040249, omega_b:0.02259909, omega_cdm:0.1188043, w0_fld:-0.9791969, wa_fld:-0.006614792, tau_reio:0.04497134, A_planck:0.9974186, amp_143:12.04573, amp_217:20.43255, amp_143x217:12.16838, n_143:0.5945278, n_217:1.045324, n_143x217:0.7852084, calTE:0.9922464, calEE:0.986712
[4 : mcmc] Initial point: logA:3.050663, n_s:0.9623175, theta_s_100:1.042048, omega_b:0.02255342, omega_cdm:0.1186688, w0_fld:-1.002143, wa_fld:0.02864212, tau_reio:0.05800379, A_planck:0.999966, amp_143:7.403503, amp_217:19.33834, amp_143x217:10.1292, n_143:1.147993, n_217:1.253417, n_143x217:1.021587, calTE:0.996946, calEE:1.006907
[6 : mcmc] Initial point: logA:3.050637, n_s:0.961375, theta_s_100:1.041336, omega_b:0.02247073, omega_cdm:0.1217384, w0_fld:-1.039361, wa_fld:-0.04458815, tau_reio:0.06254448, A_planck:0.9967626, amp_143:9.022771, amp_217:19.42763, amp_143x217:8.813448, n_143:0.7656103, n_217:0.90799, n_143x217:0.8804122, calTE:0.9880553, calEE:1.003728
[3 : mcmc] Initial point: logA:3.049529, n_s:0.9670282, theta_s_100:1.041858, omega_b:0.02231873, omega_cdm:0.1187046, w0_fld:-0.9863068, wa_fld:-0.01434019, tau_reio:0.05261551, A_planck:1.001804, amp_143:9.129428, amp_217:20.09941, amp_143x217:8.563477, n_143:0.6757008, n_217:1.300572, n_143x217:0.8111253, calTE:1.005832, calEE:0.9947194
[1 : mcmc] Initial point: logA:3.049855, n_s:0.9629607, theta_s_100:1.041569, omega_b:0.02241245, omega_cdm:0.1211135, w0_fld:-0.9763042, wa_fld:-0.05989669, tau_reio:0.05441758, A_planck:0.9980164, amp_143:10.96667, amp_217:18.88088, amp_143x217:8.934737, n_143:1.109386, n_217:1.009855, n_143x217:0.7696091, calTE:0.9995326, calEE:1.014423
[0 : mcmc] Initial point: logA:3.049156, n_s:0.9674645, theta_s_100:1.042069, omega_b:0.02239163, omega_cdm:0.1210594, w0_fld:-0.9701765, wa_fld:0.01064395, tau_reio:0.05049733, A_planck:1.000209, amp_143:10.22994, amp_217:19.26499, amp_143x217:9.580713, n_143:1.599194, n_217:1.07298, n_143x217:0.85116, calTE:0.9879141, calEE:0.9900177
[5 : mcmc] Initial point: logA:3.050713, n_s:0.9713535, theta_s_100:1.041762, omega_b:0.02232563, omega_cdm:0.1202618, w0_fld:-0.9822364, wa_fld:-0.01999689, tau_reio:0.05711237, A_planck:1.001212, amp_143:9.261705, amp_217:21.0451, amp_143x217:9.917676, n_143:1.040411, n_217:1.196917, n_143x217:0.8816047, calTE:0.9934727, calEE:0.9911811
[2 : mcmc] Initial point: logA:3.048984, n_s:0.9650286, theta_s_100:1.041459, omega_b:0.0223428, omega_cdm:0.1226118, w0_fld:-0.965806, wa_fld:-0.02889139, tau_reio:0.06393853, A_planck:0.9987781, amp_143:9.666255, amp_217:18.88405, amp_143x217:11.01931, n_143:0.8107737, n_217:0.9504953, n_143x217:0.820152, calTE:1.011386, calEE:0.9984422
[0 : model] Measuring speeds... (this may take a few seconds)
[0 : model] Setting measured speeds (per sec): {act_dr6_lenslike.ACTDR6LensLike: 8.49, sn.desy5: 114.0, bao.desi_2024_bao_all: 832.0, planck_2018_lowl.TT: 1280.0, planck_2018_lowl.EE: 3470.0, planck_NPIPE_highl_CamSpec.TTTEEE: 20.6, planckpr4lensing: 32.4, classy: 0.0237}
[0 : mcmc] Dragging with number of interpolating steps:
[0 : mcmc] * 1 : [['logA', 'n_s', 'theta_s_100', 'omega_b', 'omega_cdm', 'w0_fld', 'wa_fld', 'tau_reio']]
[0 : mcmc] * 14 : [['A_planck'], ['amp_143', 'amp_217', 'amp_143x217', 'n_143', 'n_217', 'n_143x217', 'calTE', 'calEE']]
[0 : autoselect_covmat] No cached covmat database present, not usable or not up-to-date. Will be re-created and cached.
[0 : autoselect_covmat] *WARNING* No covariance matrix found including at least one of the given parameters
[0 : mcmc] Could not automatically find a good covmat. Will generate from parameter info (proposal and prior).
[0 : mcmc] Covariance matrix not present. We will start learning the covariance of the proposal earlier: R-1 = 30 (would be 2 if all params loaded).
[0 : mcmc] Sampling!
[7 : mcmc] Progress @ 2024-06-10 06:27:07 : 1 steps taken, and 0 accepted.
[4 : mcmc] Progress @ 2024-06-10 06:27:07 : 1 steps taken, and 0 accepted.
[6 : mcmc] Progress @ 2024-06-10 06:27:08 : 1 steps taken, and 0 accepted.
[3 : mcmc] Progress @ 2024-06-10 06:27:11 : 1 steps taken, and 0 accepted.
[1 : mcmc] Progress @ 2024-06-10 06:27:11 : 1 steps taken, and 0 accepted.
[5 : mcmc] Progress @ 2024-06-10 06:27:11 : 1 steps taken, and 0 accepted.
[2 : mcmc] Progress @ 2024-06-10 06:27:11 : 1 steps taken, and 0 accepted.
[0 : mcmc] Progress @ 2024-06-10 06:27:11 : 1 steps taken, and 0 accepted.
...
[0 : mcmc] Progress @ 2024-06-12 11:36:44 : 4737 steps taken, and 1617 accepted.
[4 : mcmc] Progress @ 2024-06-12 11:36:52 : 4730 steps taken, and 1638 accepted.
[6 : mcmc] Progress @ 2024-06-12 11:37:28 : 4722 steps taken, and 1638 accepted.
[1 : mcmc] Progress @ 2024-06-12 11:37:30 : 4709 steps taken, and 1538 accepted.
[2 : mcmc] Progress @ 2024-06-12 11:37:37 : 4728 steps taken, and 1624 accepted.
[3 : mcmc] Progress @ 2024-06-12 11:37:43 : 4727 steps taken, and 1630 accepted.
[7 : mcmc] Progress @ 2024-06-12 11:37:47 : 4748 steps taken, and 1624 accepted.
[5 : mcmc] Progress @ 2024-06-12 11:37:50 : 4757 steps taken, and 1596 accepted.
[0 : mcmc] Progress @ 2024-06-12 11:37:58 : 4739 steps taken, and 1619 accepted.
[4 : mcmc] Progress @ 2024-06-12 11:38:05 : 4732 steps taken, and 1639 accepted.
[6 : mcmc] Progress @ 2024-06-12 11:38:42 : 4724 steps taken, and 1640 accepted.
[1 : mcmc] Progress @ 2024-06-12 11:38:43 : 4711 steps taken, and 1539 accepted.
[2 : mcmc] Progress @ 2024-06-12 11:38:50 : 4730 steps taken, and 1624 accepted.
[3 : mcmc] Progress @ 2024-06-12 11:38:58 : 4729 steps taken, and 1630 accepted.
[7 : mcmc] Progress @ 2024-06-12 11:38:58 : 4750 steps taken, and 1624 accepted.
[5 : mcmc] Progress @ 2024-06-12 11:39:04 : 4759 steps taken, and 1597 accepted.
[0 : mcmc] Progress @ 2024-06-12 11:39:11 : 4741 steps taken, and 1619 accepted.
[4 : mcmc] Progress @ 2024-06-12 11:39:18 : 4734 steps taken, and 1639 accepted.
done!
I guess that the problem is in the script, i think that the MPI parallelization is not working as it supose to work, already tried many ways to write the script and all of then had the same end. Someone have any ideia of how to solve this?
Thank you in advance.
Vitor Petri Silva