MPI_Init error in running CosmoMC on NERSC-Cori
Posted: April 25 2019
I have successfully compiled the CosmoMC on cori@nersc. However, while running "mpirun -np 1 ./cosmomc test_planck.ini", it crashes with following output:
kumasura@cori09:~/CosmoMC-Nov2016> mpirun -np 1 ./cosmomc test_planck.ini
[Thu Apr 25 08:19:11 2019] [unknown] Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(537):
MPID_Init(246).......: channel initialization failed
MPID_Init(638).......: PMI2 init failed: 1
forrtl: error (76): Abort trap signal
Image PC Routine Line Source
cosmomc 00000000006A7A84 for__signal_handl Unknown Unknown
libpthread-2.22.s 00002AAAB16D1C10 Unknown Unknown Unknown
libc-2.22.so 00002AAAB1912F67 gsignal Unknown Unknown
libc-2.22.so 00002AAAB191433A abort Unknown Unknown
libmpich_intel.so 00002AAAB0A66998 Unknown Unknown Unknown
libmpich_intel.so 00002AAAB09EFA32 MPIR_Handle_fatal Unknown Unknown
libmpich_intel.so 00002AAAB09EFB26 MPIR_Err_return_c Unknown Unknown
libmpich_intel.so 00002AAAB09746B4 MPI_Init Unknown Unknown
libmpich_intel.so 00002AAAB09C1A07 MPI_INIT Unknown Unknown
cosmomc 00000000005A2C33 Unknown Unknown Unknown
cosmomc 0000000000410FDE Unknown Unknown Unknown
libc-2.22.so 00002AAAB18FE725 __libc_start_main Unknown Unknown
cosmomc 0000000000410EE9 Unknown Unknown Unknown
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node cori09 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
The modules loaded in the environment are :
kumasura@cori09:~/CosmoMC-Nov2016> module list
Currently Loaded Modulefiles:
1) modules/3.2.10.6
2) nsg/1.2.0
3) intel/18.0.1.163
4) craype-network-aries
5) craype/2.5.15
6) cray-libsci/18.07.1
7) udreg/2.3.2-6.0.7.1_5.13__g5196236.ari
8) ugni/6.0.14.0-6.0.7.1_3.13__gea11d3d.ari
9) pmi/5.0.14
10) dmapp/7.1.1-6.0.7.1_5.45__g5a674e0.ari
11) gni-headers/5.0.12.0-6.0.7.1_3.11__g3b1768f.ari
12) xpmem/2.2.15-6.0.7.1_5.11__g7549d06.ari
13) job/2.2.3-6.0.7.1_5.43__g6c4e934.ari
14) dvs/2.7_2.2.118-6.0.7.1_10.1__g58b37a2
15) alps/6.6.43-6.0.7.1_5.45__ga796da32.ari
16) rca/2.2.18-6.0.7.1_5.47__g2aa4f39.ari
17) atp/2.1.3
18) PrgEnv-intel/6.0.4
19) craype-haswell
20) cray-mpich/7.7.3
21) altd/2.0
22) darshan/3.1.4
23) openmpi/3.1.3
kumasura@cori09:~/CosmoMC-Nov2016>
Please Help.
kumasura@cori09:~/CosmoMC-Nov2016> mpirun -np 1 ./cosmomc test_planck.ini
[Thu Apr 25 08:19:11 2019] [unknown] Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(537):
MPID_Init(246).......: channel initialization failed
MPID_Init(638).......: PMI2 init failed: 1
forrtl: error (76): Abort trap signal
Image PC Routine Line Source
cosmomc 00000000006A7A84 for__signal_handl Unknown Unknown
libpthread-2.22.s 00002AAAB16D1C10 Unknown Unknown Unknown
libc-2.22.so 00002AAAB1912F67 gsignal Unknown Unknown
libc-2.22.so 00002AAAB191433A abort Unknown Unknown
libmpich_intel.so 00002AAAB0A66998 Unknown Unknown Unknown
libmpich_intel.so 00002AAAB09EFA32 MPIR_Handle_fatal Unknown Unknown
libmpich_intel.so 00002AAAB09EFB26 MPIR_Err_return_c Unknown Unknown
libmpich_intel.so 00002AAAB09746B4 MPI_Init Unknown Unknown
libmpich_intel.so 00002AAAB09C1A07 MPI_INIT Unknown Unknown
cosmomc 00000000005A2C33 Unknown Unknown Unknown
cosmomc 0000000000410FDE Unknown Unknown Unknown
libc-2.22.so 00002AAAB18FE725 __libc_start_main Unknown Unknown
cosmomc 0000000000410EE9 Unknown Unknown Unknown
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node cori09 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
The modules loaded in the environment are :
kumasura@cori09:~/CosmoMC-Nov2016> module list
Currently Loaded Modulefiles:
1) modules/3.2.10.6
2) nsg/1.2.0
3) intel/18.0.1.163
4) craype-network-aries
5) craype/2.5.15
6) cray-libsci/18.07.1
7) udreg/2.3.2-6.0.7.1_5.13__g5196236.ari
8) ugni/6.0.14.0-6.0.7.1_3.13__gea11d3d.ari
9) pmi/5.0.14
10) dmapp/7.1.1-6.0.7.1_5.45__g5a674e0.ari
11) gni-headers/5.0.12.0-6.0.7.1_3.11__g3b1768f.ari
12) xpmem/2.2.15-6.0.7.1_5.11__g7549d06.ari
13) job/2.2.3-6.0.7.1_5.43__g6c4e934.ari
14) dvs/2.7_2.2.118-6.0.7.1_10.1__g58b37a2
15) alps/6.6.43-6.0.7.1_5.45__ga796da32.ari
16) rca/2.2.18-6.0.7.1_5.47__g2aa4f39.ari
17) atp/2.1.3
18) PrgEnv-intel/6.0.4
19) craype-haswell
20) cray-mpich/7.7.3
21) altd/2.0
22) darshan/3.1.4
23) openmpi/3.1.3
kumasura@cori09:~/CosmoMC-Nov2016>
Please Help.