[CosmoMC + CosmoChord + ModeCode] Error in testing test_planck.ini

Use of Cobaya. camb, CLASS, cosmomc, compilers, etc.
Post Reply
Mason Ng
Posts: 26
Joined: May 17 2017
Affiliation: The University of Auckland

[CosmoMC + CosmoChord + ModeCode] Error in testing test_plan

Post by Mason Ng » August 11 2017

I managed to get {CosmoMC + CosmoChord} and {CosmoMC + CosmoChord + ModeCode} working (finally!) while just using these modules:

Code: Select all

Currently Loaded Modules:
  1) GCCcore/6.3.0                 3) GCC/6.3.0
  2) binutils/2.27-GCCcore-6.3.0   4) OpenMPI/2.0.2-GCC-6.3.0
I have successfully carried out testing on test.ini for each of the two configurations. However, test_planck.ini yields errors. Before elaborating on that, I should point out that in the .bashrc file in my account on the cluster, I commented the line

Code: Select all

source /projects/uoa00518/plc-2.0/bin/clik_profile.sh


because the cluster folks found out that this was the cause of the error outlined in the last post of this thread: http://cosmocoffee.info/viewtopic.php?t ... 502461a9f3. After that, I could successfully build the two configurations outlined above.

Now, when trying to get test_planck.ini to work, I either get one of the two errors (not sure what's the cause; am also in contact with cluster folks):

Code: Select all

Program received signal SIGILL: Illegal instruction.

Backtrace for this error:

Program received signal SIGILL: Illegal instruction.

Backtrace for this error:
#0  0x2b0f909d091f in ???
#1  0x50155c in ???
#2  0x403e5c in ???
#3  0x2b0f909bccdc in ???
#4  0x403eac in ???
#0  0x2b900db7e91f in ???
#1  0x50155c in ???
#2  0x403e5c in ???
#3  0x2b900db6acdc in ???
#4  0x403eac in ???
srun: error: compute-a1-066: task 0: Illegal instruction (core dumped)
srun: error: compute-a1-068: task 1: Illegal instruction (core dumped)
#0  0x2b1ab620091f in ???
#1  0x50155c in ???
#2  0x403e5c in ???
#3  0x2b1ab61eccdc in ???
#4  0x403eac in ???
#0  0x2b48f48d191f in ???
#1  0x50155c in ???
#2  0x403e5c in ???
#3  0x2b48f48bdcdc in ???
#4  0x403eac in ???
srun: error: compute-gpu-a1-002: task 3: Illegal instruction (core dumped)
srun: error: compute-gpu-a1-001: task 2: Illegal instruction (core dumped)
or

Code: Select all

Number of MPI processes:           4
 file_root:test
 NOTE: num_massive_neutrinos ignored, using specified hierarchy
 NOTE: num_massive_neutrinos ignored, using specified hierarchy
 Random seeds:  3286, 12940 rand_inst:   1
 Random seeds:  4014, 12903 rand_inst:   4
 NOTE: num_massive_neutrinos ignored, using specified hierarchy
 NOTE: num_massive_neutrinos ignored, using specified hierarchy
 Random seeds:  2911, 12974 rand_inst:   3
 Random seeds:  3300, 13013 rand_inst:   2
 compile with CLIK to use clik - see Makefile
 MpiStop:            1
 compile with CLIK to use clik - see Makefile
 MpiStop:            2
 compile with CLIK to use clik - see Makefile
 MpiStop:            3
 compile with CLIK to use clik - see Makefile
 MpiStop:            0
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 128.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode 128.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 128.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 128.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
srun: error: compute-e1-001: task 0: Exited with exit code 128
srun: error: compute-e1-013: task 3: Exited with exit code 128
srun: error: compute-e1-002: task 1: Exited with exit code 128
srun: error: compute-e1-007: task 2: Exited with exit code 128
In fact, when I tried doing what was outlined here: http://cosmocoffee.info/viewtopic.php?p=5731 and also logged back out and in, I get the SIGILL error..

Please advise, thanks!

Mason Ng
Posts: 26
Joined: May 17 2017
Affiliation: The University of Auckland

[CosmoMC + CosmoChord + ModeCode] Error in testing test_plan

Post by Mason Ng » August 12 2017

I managed to resolve the SIGILL error.

I still get the second error,

Code: Select all

compile with CLIK to use clik - see Makefile
however. Since the line "source /projects/uoa00518/plc-2.0/bin/clik_profile.sh" was commented in ~/.bashrc, I executed the line before submitting the job to the cluster. I still get the error above.

I have also tried putting that line in the job batch script to no avail. The cluster folks did write to me that using "source" can lead to a "version conflict if the environment is not clean" if I understand it right. Also, I was advised to use "the command: 'module load' to load the environmental variables (and not the file: clik_profile.sh) and the command: 'module purge' to clean your environment". I find it odd that one has to use 'module load' to load environmental variables, since I have always use 'module load' to load cluster packages. The clik_profile.sh file is attached (put a .txt extension).

clik_profile.txt

I did a search around and I could not find anything about loading environment variables (like) the way one loads cluster packages. Please advise, thanks!

Antony Lewis
Posts: 1936
Joined: September 23 2004
Affiliation: University of Sussex
Contact:

Re: [CosmoMC + CosmoChord + ModeCode] Error in testing test_

Post by Antony Lewis » August 12 2017

You can check if clik is setup correctly by using "echo $CLIK_PATH" , which should show installed path.

Mason Ng
Posts: 26
Joined: May 17 2017
Affiliation: The University of Auckland

Re: [CosmoMC + CosmoChord + ModeCode] Error in testing test_

Post by Mason Ng » August 14 2017

Antony Lewis wrote:You can check if clik is setup correctly by using "echo \$CLIK_PATH" , which should show installed path.
echo \$CLIK_PATH does give me the right installed path.

echo \$PLANCKLIKE gives me "cliklike" as well (although I had to put it in the CosmoMC makefile and the source makefile, and also in my ~/.bashrc file to make sure).

I re-entered the cluster, after sourcing the clik_profile.sh file, to make sure that was the case. When I ran the job, I still get the same error (same as the first post):

Code: Select all

compile with CLIK to use clik - see Makefile
I should also note that the CosmoMC I am using is from GitHub (used git clone from https://github.com/cmbant/CosmoMC.git).

Please advise, thanks.

Mason Ng
Posts: 26
Joined: May 17 2017
Affiliation: The University of Auckland

[CosmoMC + CosmoChord + ModeCode] Error in testing test_plan

Post by Mason Ng » August 14 2017

My supervisor took a look at the clik_profile.sh file (from above) and noted that it could be truncated, where the lines directly involving the likelihood are preserved, and the lines about the cluster modules be cut out. The truncated file is below:

clik_profile_truncated.rtf

When this was done, I did a quick check by seeing if that file could be used with just CosmoMC. It turns out that after building CosmoMC, I could still run 'mpirun -np 1 ./cosmomc test.ini' and 'mpirun -np 1 ./cosmomc test_planck.ini' successfully.

The reason this all happened was because my supervisor was suspicious about the suggestion that one would need to comment the line "source paths/clik_profile.sh" in the ~/.bashrc file to build CosmoMC - although this led to CosmoMC not being able to find the Planck likelihood (original purpose of the post).

Now, when using this truncated clik_profile.sh file and then trying to build CosmoMC + CosmoChord, I get the error that

Code: Select all

mpif90 -cpp -O3 -ffast-math -ffree-line-length-none -fopenmp -fmax-errors=4 -march=native -DMPI -DCLIK -I../camb/ReleaseMPI -I/gpfs1m/projects/uoa00518/plc-2.0/include -I../polychord -JReleaseMPI -IReleaseMPI/  -I../polychord  -c cliklike.f90 -o ReleaseMPI/cliklike.o
f951: Fatal Error: Reading module ‘clik’ at line 1 column 2: Unexpected EOF
compilation terminated.
make[1]: *** [ReleaseMPI/cliklike.o] Error 1
make[1]: Leaving directory `/gpfs1m/projects/uoa00518/Work/CosmoMC/source'
make: *** [cosmomc] Error 2
It looked like it might have been because the module was not found, so I had the random (but probably open to corrupt file problems) idea of copying the files from a working CosmoMC folder [specifically, from source/ReleaseMPI] to the source/ReleaseMPI folder for the CosmoMC + CosmoChord combination. These files are:

Code: Select all

cliklike.o, cliklike.mod, CMB.o, DataLikelihoods.o, datalikelihoodlist.mod, calclike.o, calclike.mod, ImportanceSampling.o, importancesampling.mod, MCMC.o, minimize.o, samplecollector.mod, SampleCollector.o, GeneralSetup.o, generalsetup.mod, CalcLlike_Cosmology.o, calclike_cosmology.mod, CosmologyConfig.o, cosmologyconfig.mod, nestwrap.o, nestwrap.mod, driver.o


After driver.o, I get the following error:

cosmomc_error.rtf

I hope this can be resolved soon! Thanks.

Mason Ng
Posts: 26
Joined: May 17 2017
Affiliation: The University of Auckland

[CosmoMC + CosmoChord + ModeCode] Error in testing test_plan

Post by Mason Ng » August 31 2017

Managed to get this sorted.

It's just about using the right modules on the cluster - I mixed this up a lot in the beginning. Apologies!

Post Reply