return to command-line prompt after cosmomc run

Use of Cobaya. camb, CLASS, cosmomc, compilers, etc.
Post Reply
Matthew Walker
Posts: 3
Joined: January 20 2011
Affiliation: Harvard-Smithsonian Center for Astrophysics
Contact:

return to command-line prompt after cosmomc run

Post by Matthew Walker » January 21 2011

I run cosmomc as a generic sampler, often in a script that calls cosmomc (with openmpi) multiple times. A recurring problem for me is that in cases where the MPI_Converge_Stop criterion is not met and the chain runs to the capacity specified by the 'samples' parameter, occasionally the job hangs after printing 'Stopping as have [sample] samples' to the screen output and does not return to the command line prompt. When I am running cosmomc as part of a script, this behavior stalls the script (i.e., things don't advance to the command after the cosmomc run). Is there any simple way I can get the cosmomc run to come to end itself fully when the maximum number of samples has been reached?
Thanks in advance,
Matt

Matthew Walker
Posts: 3
Joined: January 20 2011
Affiliation: Harvard-Smithsonian Center for Astrophysics
Contact:

return to command-line prompt after cosmomc run

Post by Matthew Walker » January 21 2011

I should have added that after the 'Stopping as have [sample] samples', the screen output includes reports of the number of slow proposals, says 'finished', and reports the running time as normal. It just doesn't return to a command-line prompt.

Antony Lewis
Posts: 1944
Joined: September 23 2004
Affiliation: University of Sussex
Contact:

Re: return to command-line prompt after cosmomc run

Post by Antony Lewis » January 21 2011

Probably only some of the chains have generated the limiting number of samples, so the rest are still running? You could try adding a call to DoAbort in DoStop in paramdef.F90. There may be a problem that continuing chains hang if they try to MPI communicate to the chain that has already stopped.

Matthew Walker
Posts: 3
Joined: January 20 2011
Affiliation: Harvard-Smithsonian Center for Astrophysics
Contact:

return to command-line prompt after cosmomc run

Post by Matthew Walker » January 22 2011

Thanks, Antony. I put a 'call DoAbort(S)' just before the 'endif' in the DoStop subroutine in paramdef.F90. Apparently this now kills the process when the first chain reaches the maximum number of samples. For my purposes that should be fine as long as I choose a sufficiently large value for the maximum number of samples (the slower chains tend not to be very far behind). The screen output now includes the following complaints, which I am tempted to think are 'ok':


--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
openmpirun has exited due to process rank 0 with PID 5880 on
node dhcp152-185.cfa.harvard.edu exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by openmpirun (as reported here).
--------------------------------------------------------------------------

Sheng Li
Posts: 57
Joined: May 26 2009
Affiliation: University of Sussex
Contact:

return to command-line prompt after cosmomc run

Post by Sheng Li » February 02 2011

One probability is that you mixed one file or two which come from different version of camb or just in source.

check those files which request MPI, and differ them with the original version of cosmomc.

As stated above from your posts, this file could be paramdef.F90, but maybe other files some routines have 'MPI' ... check them be carefully.

If you got nothing difference for sure, it will relate to your computing environment (though this is vary rare I think).
Otherwise, u need consider which files call 'init' or other key.

Good luck.

Post Reply