Page 1 of 1

MPI_Init error in running CosmoMC on NERSC-Cori

Posted: April 25 2019
by suraj kumar
I have successfully compiled the CosmoMC on cori@nersc. However, while running "mpirun -np 1 ./cosmomc test_planck.ini", it crashes with following output:

kumasura@cori09:~/CosmoMC-Nov2016> mpirun -np 1 ./cosmomc test_planck.ini
[Thu Apr 25 08:19:11 2019] [unknown] Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(537):
MPID_Init(246).......: channel initialization failed
MPID_Init(638).......: PMI2 init failed: 1
forrtl: error (76): Abort trap signal
Image PC Routine Line Source
cosmomc 00000000006A7A84 for__signal_handl Unknown Unknown
libpthread-2.22.s 00002AAAB16D1C10 Unknown Unknown Unknown
libc-2.22.so 00002AAAB1912F67 gsignal Unknown Unknown
libc-2.22.so 00002AAAB191433A abort Unknown Unknown
libmpich_intel.so 00002AAAB0A66998 Unknown Unknown Unknown
libmpich_intel.so 00002AAAB09EFA32 MPIR_Handle_fatal Unknown Unknown
libmpich_intel.so 00002AAAB09EFB26 MPIR_Err_return_c Unknown Unknown
libmpich_intel.so 00002AAAB09746B4 MPI_Init Unknown Unknown
libmpich_intel.so 00002AAAB09C1A07 MPI_INIT Unknown Unknown
cosmomc 00000000005A2C33 Unknown Unknown Unknown
cosmomc 0000000000410FDE Unknown Unknown Unknown
libc-2.22.so 00002AAAB18FE725 __libc_start_main Unknown Unknown
cosmomc 0000000000410EE9 Unknown Unknown Unknown
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node cori09 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

The modules loaded in the environment are :

kumasura@cori09:~/CosmoMC-Nov2016> module list
Currently Loaded Modulefiles:
1) modules/3.2.10.6
2) nsg/1.2.0
3) intel/18.0.1.163
4) craype-network-aries
5) craype/2.5.15
6) cray-libsci/18.07.1
7) udreg/2.3.2-6.0.7.1_5.13__g5196236.ari
8) ugni/6.0.14.0-6.0.7.1_3.13__gea11d3d.ari
9) pmi/5.0.14
10) dmapp/7.1.1-6.0.7.1_5.45__g5a674e0.ari
11) gni-headers/5.0.12.0-6.0.7.1_3.11__g3b1768f.ari
12) xpmem/2.2.15-6.0.7.1_5.11__g7549d06.ari
13) job/2.2.3-6.0.7.1_5.43__g6c4e934.ari
14) dvs/2.7_2.2.118-6.0.7.1_10.1__g58b37a2
15) alps/6.6.43-6.0.7.1_5.45__ga796da32.ari
16) rca/2.2.18-6.0.7.1_5.47__g2aa4f39.ari
17) atp/2.1.3
18) PrgEnv-intel/6.0.4
19) craype-haswell
20) cray-mpich/7.7.3
21) altd/2.0
22) darshan/3.1.4
23) openmpi/3.1.3
kumasura@cori09:~/CosmoMC-Nov2016>

Please Help.

Re: MPI_Init error in running CosmoMC on NERSC-Cori

Posted: April 26 2019
by suraj kumar
However, i tried to run the same with srun as there was a note on nersc web portal stating that there is no 'mpirun' command – which is used by many MPI implementations – on Cori. But even with srun I am getting following error.

Number of MPI processes: 2
file_root:test
Random seeds: 13219, 10873 rand_inst: 1
Random seeds: 13325, 10874 rand_inst: 2
Using clik with likelihood file
./data/clik/hi_l/plik/plik_dx11dr2_HM_v18_TT.clik
TT from l=0 to l= 2508
Clik will run with the following nuisance parameters:
A_cib_217
cib_index
xi_sz_cib
A_sz
ps_A_100_100
ps_A_143_143
ps_A_143_217
ps_A_217_217
ksz_norm
gal545_A_100
gal545_A_143
gal545_A_143_217
gal545_A_217
calib_100T
calib_217T
A_planck
Using clik with likelihood file
./data/clik/low_l/bflike/lowl_SMW_70_dx11d_2014_10_03_v5c_Ap.clik
TT from l=0 to l= 2508
forrtl: severe (257): formatted I/O to unit open for unformatted transfers, unit 42, file /global/u2/k/kumasura/plc_2.0/low_l/bflike/lowl_SMW_70_dx11d_2014_10_03_v5c_Ap.clik/clik/lkl_0/external/.//params_bflike.ini
Image PC Routine Line Source
cosmomc 000000000066991E for__io_return Unknown Unknown
libifcoremt.so.5 00002AAAB453870B for_read_seq_nml Unknown Unknown
libclik.so 00002AAAB215535B bflike_smw_mp_ini Unknown Unknown
libclik.so 00002AAAB2113CD3 bflike_smw_extra Unknown Unknown
libclik.so 00002AAAB20F16E7 clik_bflike_smw_i Unknown Unknown
libclik.so 00002AAAB20BD710 clik_lklobject_in Unknown Unknown
libclik.so 00002AAAB20B4ED3 clik_init Unknown Unknown
libclik_f90.so 00002AAAAACD1754 fortran_clik_init Unknown Unknown
libclik_f90.so 00002AAAAACD54A4 clik_mp_clik_init Unknown Unknown
cosmomc 000000000050762A Unknown Unknown Unknown
cosmomc 0000000000504D8A Unknown Unknown Unknown
cosmomc 000000000055A21A Unknown Unknown Unknown
cosmomc 000000000058F04D Unknown Unknown Unknown
cosmomc 000000000059847A Unknown Unknown Unknown
cosmomc 0000000000410E5E Unknown Unknown Unknown
libc-2.22.so 00002AAAB18FE725 __libc_start_main Unknown Unknown
cosmomc 0000000000410D69 Unknown Unknown Unknown
forrtl: severe (257): formatted I/O to unit open for unformatted transfers, unit 42, file /global/u2/k/kumasura/plc_2.0/low_l/bflike/lowl_SMW_70_dx11d_2014_10_03_v5c_Ap.clik/clik/lkl_0/external/.//params_bflike.ini
Image PC Routine Line Source
cosmomc 000000000066991E for__io_return Unknown Unknown
libifcoremt.so.5 00002AAAB453870B for_read_seq_nml Unknown Unknown
libclik.so 00002AAAB215535B bflike_smw_mp_ini Unknown Unknown
libclik.so 00002AAAB2113CD3 bflike_smw_extra Unknown Unknown
libclik.so 00002AAAB20F16E7 clik_bflike_smw_i Unknown Unknown
libclik.so 00002AAAB20BD710 clik_lklobject_in Unknown Unknown
libclik.so 00002AAAB20B4ED3 clik_init Unknown Unknown
libclik_f90.so 00002AAAAACD1754 fortran_clik_init Unknown Unknown
libclik_f90.so 00002AAAAACD54A4 clik_mp_clik_init Unknown Unknown
cosmomc 000000000050762A Unknown Unknown Unknown
cosmomc 0000000000504D8A Unknown Unknown Unknown
cosmomc 000000000055A21A Unknown Unknown Unknown
cosmomc 000000000058F04D Unknown Unknown Unknown
cosmomc 000000000059847A Unknown Unknown Unknown
cosmomc 0000000000410E5E Unknown Unknown Unknown
libc-2.22.so 00002AAAB18FE725 __libc_start_main Unknown Unknown
cosmomc 0000000000410D69 Unknown Unknown Unknown
clik version 723c1a4b0580
smica
Checking likelihood './data/clik/hi_l/plik/plik_dx11dr2_HM_v18_TT.clik' on test data. got -380.979 expected -380.979 (diff -8.68135e-09)
clik version 723c1a4b0580
smica
Checking likelihood './data/clik/hi_l/plik/plik_dx11dr2_HM_v18_TT.clik' on test data. got -380.979 expected -380.979 (diff -8.68135e-09)
srun: error: nid01291: task 1: Exited with exit code 1
srun: Terminating job step 20792771.0
srun: error: nid01290: task 0: Exited with exit code 1

I tried it with CosmoMC 2015.

Please let me know if it is compilation error or I am missing something.