CosmoCoffee Forum Index CosmoCoffee

 
 FAQFAQ   SearchSearch  MemberlistSmartFeed   MemberlistMemberlist    RegisterRegister 
   ProfileProfile   Log inLog in 
Arxiv New Filter | Bookmarks & clubs | Arxiv ref/author:

[CosmoMC + CosmoChord + ModeCode] Error in testing test_planck.ini
 
Post new topic   Reply to topic    CosmoCoffee Forum Index -> Computers and software
View previous topic :: View next topic  
Author Message
Mason Ng



Joined: 17 May 2017
Posts: 18
Affiliation: The University of Auckland

PostPosted: August 11 2017  Reply with quote

I managed to get CosmoMC + CosmoChord and CosmoMC + CosmoChord + ModeCode working (finally!) while just using these modules:

Code:
Currently Loaded Modules:
  1) GCCcore/6.3.0                 3) GCC/6.3.0
  2) binutils/2.27-GCCcore-6.3.0   4) OpenMPI/2.0.2-GCC-6.3.0


I have successfully carried out testing on test.ini for each of the two configurations. However, test_planck.ini yields errors. Before elaborating on that, I should point out that in the .bashrc file in my account on the cluster, I commented the line

Code:
source /projects/uoa00518/plc-2.0/bin/clik_profile.sh


because the cluster folks found out that this was the cause of the error outlined in the last post of this thread: http://cosmocoffee.info/viewtopic.php?t=2867&sid=454222fe45dde99cc7bfb9502461a9f3. After that, I could successfully build the two configurations outlined above.

Now, when trying to get test_planck.ini to work, I either get one of the two errors (not sure what's the cause; am also in contact with cluster folks):

Code:
Program received signal SIGILL: Illegal instruction.

Backtrace for this error:

Program received signal SIGILL: Illegal instruction.

Backtrace for this error:
#0  0x2b0f909d091f in ???
#1  0x50155c in ???
#2  0x403e5c in ???
#3  0x2b0f909bccdc in ???
#4  0x403eac in ???
#0  0x2b900db7e91f in ???
#1  0x50155c in ???
#2  0x403e5c in ???
#3  0x2b900db6acdc in ???
#4  0x403eac in ???
srun: error: compute-a1-066: task 0: Illegal instruction (core dumped)
srun: error: compute-a1-068: task 1: Illegal instruction (core dumped)
#0  0x2b1ab620091f in ???
#1  0x50155c in ???
#2  0x403e5c in ???
#3  0x2b1ab61eccdc in ???
#4  0x403eac in ???
#0  0x2b48f48d191f in ???
#1  0x50155c in ???
#2  0x403e5c in ???
#3  0x2b48f48bdcdc in ???
#4  0x403eac in ???
srun: error: compute-gpu-a1-002: task 3: Illegal instruction (core dumped)
srun: error: compute-gpu-a1-001: task 2: Illegal instruction (core dumped)


or

Code:

Number of MPI processes:           4
 file_root:test
 NOTE: num_massive_neutrinos ignored, using specified hierarchy
 NOTE: num_massive_neutrinos ignored, using specified hierarchy
 Random seeds:  3286, 12940 rand_inst:   1
 Random seeds:  4014, 12903 rand_inst:   4
 NOTE: num_massive_neutrinos ignored, using specified hierarchy
 NOTE: num_massive_neutrinos ignored, using specified hierarchy
 Random seeds:  2911, 12974 rand_inst:   3
 Random seeds:  3300, 13013 rand_inst:   2
 compile with CLIK to use clik - see Makefile
 MpiStop:            1
 compile with CLIK to use clik - see Makefile
 MpiStop:            2
 compile with CLIK to use clik - see Makefile
 MpiStop:            3
 compile with CLIK to use clik - see Makefile
 MpiStop:            0
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 128.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode 128.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 128.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 128.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
srun: error: compute-e1-001: task 0: Exited with exit code 128
srun: error: compute-e1-013: task 3: Exited with exit code 128
srun: error: compute-e1-002: task 1: Exited with exit code 128
srun: error: compute-e1-007: task 2: Exited with exit code 128


In fact, when I tried doing what was outlined here: http://cosmocoffee.info/viewtopic.php?p=5731 and also logged back out and in, I get the SIGILL error..

Please advise, thanks!
Back to top
View user's profile  
Mason Ng



Joined: 17 May 2017
Posts: 18
Affiliation: The University of Auckland

PostPosted: August 12 2017  Reply with quote

I managed to resolve the SIGILL error.

I still get the second error,

Code:
compile with CLIK to use clik - see Makefile


however. Since the line "source /projects/uoa00518/plc−2.0/bin/clik_profile.sh" was commented in ~/.bashrc, I executed the line before submitting the job to the cluster. I still get the error above.

I have also tried putting that line in the job batch script to no avail. The cluster folks did write to me that using "source" can lead to a "version conflict if the environment is not clean" if I understand it right. Also, I was advised to use "the command: 'module load' to load the environmental variables (and not the file: clik_profile.sh) and the command: 'module purge' to clean your environment". I find it odd that one has to use 'module load' to load environmental variables, since I have always use 'module load' to load cluster packages. The clik_profile.sh file is attached (put a .txt extension).

clik_profile.txt

I did a search around and I could not find anything about loading environment variables (like) the way one loads cluster packages. Please advise, thanks!
Back to top
View user's profile  
Antony Lewis



Joined: 23 Sep 2004
Posts: 1257
Affiliation: University of Sussex

PostPosted: August 12 2017  Reply with quote

You can check if clik is setup correctly by using "echo $CLIK_PATH" , which should show installed path.
Back to top
View user's profile [ Hidden ] Visit poster's website
Mason Ng



Joined: 17 May 2017
Posts: 18
Affiliation: The University of Auckland

PostPosted: August 14 2017  Reply with quote

Antony Lewis wrote:
You can check if clik is setup correctly by using "echo $CLIK_PATH" , which should show installed path.


echo $CLIK_PATH does give me the right installed path.

echo $PLANCKLIKE gives me "cliklike" as well (although I had to put it in the CosmoMC makefile and the source makefile, and also in my ~/.bashrc file to make sure).

I re-entered the cluster, after sourcing the clik_profile.sh file, to make sure that was the case. When I ran the job, I still get the same error (same as the first post):

Code:
compile with CLIK to use clik - see Makefile


I should also note that the CosmoMC I am using is from GitHub (used git clone from https://github.com/cmbant/CosmoMC.git).

Please advise, thanks.
Back to top
View user's profile  
Mason Ng



Joined: 17 May 2017
Posts: 18
Affiliation: The University of Auckland

PostPosted: August 14 2017  Reply with quote

My supervisor took a look at the clik_profile.sh file (from above) and noted that it could be truncated, where the lines directly involving the likelihood are preserved, and the lines about the cluster modules be cut out. The truncated file is below:

clik_profile_truncated.rtf

When this was done, I did a quick check by seeing if that file could be used with just CosmoMC. It turns out that after building CosmoMC, I could still run 'mpirun -np 1 ./cosmomc test.ini' and 'mpirun -np 1 ./cosmomc test_planck.ini' successfully.

The reason this all happened was because my supervisor was suspicious about the suggestion that one would need to comment the line "source paths/clik_profile.sh" in the ~/.bashrc file to build CosmoMC - although this led to CosmoMC not being able to find the Planck likelihood (original purpose of the post).

Now, when using this truncated clik_profile.sh file and then trying to build CosmoMC + CosmoChord, I get the error that

Code:
mpif90 -cpp -O3 -ffast-math -ffree-line-length-none -fopenmp -fmax-errors=4 -march=native -DMPI -DCLIK -I../camb/ReleaseMPI -I/gpfs1m/projects/uoa00518/plc-2.0/include -I../polychord -JReleaseMPI -IReleaseMPI/  -I../polychord  -c cliklike.f90 -o ReleaseMPI/cliklike.o
f951: Fatal Error: Reading module ‘clik’ at line 1 column 2: Unexpected EOF
compilation terminated.
make[1]: *** [ReleaseMPI/cliklike.o] Error 1
make[1]: Leaving directory `/gpfs1m/projects/uoa00518/Work/CosmoMC/source'
make: *** [cosmomc] Error 2


It looked like it might have been because the module was not found, so I had the random (but probably open to corrupt file problems) idea of copying the files from a working CosmoMC folder [specifically, from source/ReleaseMPI] to the source/ReleaseMPI folder for the CosmoMC + CosmoChord combination. These files are:

Code:
cliklike.o, cliklike.mod, CMB.o, DataLikelihoods.o, datalikelihoodlist.mod, calclike.o, calclike.mod, ImportanceSampling.o, importancesampling.mod, MCMC.o, minimize.o, samplecollector.mod, SampleCollector.o, GeneralSetup.o, generalsetup.mod, CalcLlike_Cosmology.o, calclike_cosmology.mod, CosmologyConfig.o, cosmologyconfig.mod, nestwrap.o, nestwrap.mod, driver.o


After driver.o, I get the following error:

cosmomc_error.rtf

I hope this can be resolved soon! Thanks.
Back to top
View user's profile  
Display posts from previous:   
Post new topic   Reply to topic    CosmoCoffee Forum Index -> Computers and software All times are GMT + 5 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group. Sponsored by WordWeb online dictionary and dictionary software.