Is anyone else having problems with incomplete checkpoint files in CosmoMC MPI runs? Some are written fully, but others are incomplete; the file just ends prematurely. This happens whether the job is killed by the job scheduler because it hit a time limit, or CosmoMC terminates because the requested number of samples has been obtained. It also happens for quite small test runs, whether I write the checkpoints to home space or to fast, parallel scratch space.
Examining the checkpoint files by hand, they seem to be written correctly right up to the point where they just end.
I'm running CosmoMC as a generic sampler with num_hard = 54.
I'm compiling using the Intel compiler in ICS 12.1.4 and OpenMPI 1.4. I've tried replacing the call to flush() in the FlushFile subroutine in utils.F90 with a call to the COMMITQQ function provided by ifort, and which should operate like FLUSH but in a blocking mode. No improvement though.
Thanks in advance for any help!
Geraint
CosmoMC: incomplete checkpoint files
-
- Posts: 3
- Joined: March 15 2013
- Affiliation: University College London