checkpoint files not getting created

Use of Cobaya. camb, CLASS, cosmomc, compilers, etc.
shadab alam
Posts: 14
Joined: November 06 2013
Affiliation: CMU

checkpoint files not getting created

Post by shadab alam » February 21 2014

Hello everyone,

[I am sorry if its a repetition but couldn't find anything relevant.]

I have installed latest version of COSMOMC and it works perfectly.

Now I have written a module which also compiles and run without any error and everything makes sense. But the problem is that when I include my module then cosmomc doesn't produce *.chk files, Hence I cannot re-run it when the job stops due to system run time limit. I have runned for 7 days which finished 140,000 steps still there was no *.chk file in the chain directory. I have checked the input parameters and everything looks right.

Although the *.chk files gets created if I don't include my module in the run. Do I need to write something in the new module for *.chk ?
Any idea for possible reason or where to look for the issue will be highly appreciated.

Thanks,
Shadab

Sheng Li
Posts: 57
Joined: May 26 2009
Affiliation: University of Sussex
Contact:

Re: checkpoint files not getting created

Post by Sheng Li » February 21 2014

shadab alam wrote:Hello everyone,

[I am sorry if its a repetition but couldn't find anything relevant.]

I have installed latest version of COSMOMC and it works perfectly.

Now I have written a module which also compiles and run without any error and everything makes sense. But the problem is that when I include my module then cosmomc doesn't produce *.chk files, Hence I cannot re-run it when the job stops due to system run time limit. I have runned for 7 days which finished 140,000 steps still there was no *.chk file in the chain directory. I have checked the input parameters and everything looks right.

Although the *.chk files gets created if I don't include my module in the run. Do I need to write something in the new module for *.chk ?
Any idea for possible reason or where to look for the issue will be highly appreciated.

Thanks,
Shadab
If you did not modify the option below
#If true, generate checkpoint files and terminated runs can be restarted using exactly the same command
#and chains continued from where they stopped
#With checkpoint=T note you must delete all chains/file_root.* files if you want new chains with an old file_root
checkpoint = T
which comes from the file
batch1/common_bath1.ini
under your CosmoMc folder,

then, the question is how did you include your module?

shadab alam
Posts: 14
Joined: November 06 2013
Affiliation: CMU

checkpoint files not getting created

Post by shadab alam » February 22 2014

Hi Sheng Li,

Thanks for taking time to reply.

I do have checkpoint=T in my input file and indep_param=1 (to save data file also for later importance sampling).

I have added the module as a data likelihood. It works exactly like supernovae module. Infact, I used supernovae data likelihood module as the template. I have 3 extra parameters which are entirely handled in the data likelihood similar to the supernovae file.

I don't have any idea of where to start . when I just remove the use of my module without changing anything else then I do see the *chk file. But there is nothing in the likelihood module which has to do with *chk file or at least I couldn't find anything related.

Any suggestion will be appreciated.
I am not sure if I am providing enough information for people to give suggestion. Please tell me if more information is needed.

Thanks,
Shadab

Sheng Li
Posts: 57
Joined: May 26 2009
Affiliation: University of Sussex
Contact:

Re: checkpoint files not getting created

Post by Sheng Li » February 22 2014

shadab alam wrote:Hi Sheng Li,

Thanks for taking time to reply.

I do have checkpoint=T in my input file and indep_param=1 (to save data file also for later importance sampling).

I have added the module as a data likelihood. It works exactly like supernovae module. Infact, I used supernovae data likelihood module as the template. I have 3 extra parameters which are entirely handled in the data likelihood similar to the supernovae file.

I don't have any idea of where to start . when I just remove the use of my module without changing anything else then I do see the *chk file. But there is nothing in the likelihood module which has to do with *chk file or at least I couldn't find anything related.

Any suggestion will be appreciated.
I am not sure if I am providing enough information for people to give suggestion. Please tell me if more information is needed.

Thanks,
Shadab

According to your description, as far as I can see at the moment, you may check if your self-defined (if it indeed is) likelihood is correct.

btw: what did you mean exactly by saying "3 extra parameters which are entirely handled in the data likelihood"? Have you had any of these three params recognised in files: param_def, param_CMB and CMB_Cls_simple?

For example:

Code: Select all

params_CMB.f90:9:    !Also a background-only parameterization, e.g. for use with just supernoave etc
params_CMB.f90:270:    !!! Simple parameterization for background data, e.g. Supernovae only (no thermal history)
If not, then your saying "entirely" may be not correct. Hence, you need to make those three params recogonised in either or both above files.

Sheng Li
Posts: 57
Joined: May 26 2009
Affiliation: University of Sussex
Contact:

Re: checkpoint files not getting created

Post by Sheng Li » February 25 2014

shadab alam wrote:Hi Sheng Li,

Thanks for taking time to reply.

I do have checkpoint=T in my input file and indep_param=1 (to save data file also for later importance sampling).

I have added the module as a data likelihood. It works exactly like supernovae module. Infact, I used supernovae data likelihood module as the template. I have 3 extra parameters which are entirely handled in the data likelihood similar to the supernovae file.

I don't have any idea of where to start . when I just remove the use of my module without changing anything else then I do see the *chk file. But there is nothing in the likelihood module which has to do with *chk file or at least I couldn't find anything related.

Any suggestion will be appreciated.
I am not sure if I am providing enough information for people to give suggestion. Please tell me if more information is needed.

Thanks,
Shadab

Also, you need to be aware of the following settings in several files under source folder:
paramdef.F90
checkpoint_freq = 100
....
....
if (checkpoint. and. all_burn .and. (.not. done_check .or. mod(sample_num+1,checkpoint_freq)==0)) then
....
These tell you on what frequency and condition you program will generate ,chk file.
Therefore, you can check if your program has examined and recorded 99 effective samples into each txt files --- accepted samples I suppose --- under the "file_root" folder.
The logical is clear, so I would not bothering you telling too much here.

However, one other thing is very vital to your program. Before considering the option above listed, you *must* make sure that your likelihood or your module was correctly deployed in order to trigger the following subroutine:
param_def.F90
AddMPIParams(P,like,checkpoint_start)
if you had changed to use your own likelihood function/file.


NOTE: all above mentioned will happen if you have not changed the default key-value pairs for other parameters in param_def.F90 or other source files.
If you did change, make sure you are in the right direction.

shadab alam
Posts: 14
Joined: November 06 2013
Affiliation: CMU

checkpoint files not getting created

Post by shadab alam » February 26 2014

Hi Sheng Li,

I really appreciate your help.
I was just trying to understand and do all the checks you mentioned.

Now, I still don’t understand the reason of the problem.I am sure when I ran the cosmomc with my module it ran for more than 100 steps. It ran for 140,000 steps. I printed the likelihood after every step. For each processor it was overall reducing or stationary. I have also used getdist to make some plots in order to see if the MCMC is converging and it looked as if it was doing right thing. Here is the likelihood for my parameters after 7 days of run.
Image

Now, I think it will be good to describe in detail what I am trying to do in order to clarify things.

Basically, I have created module to fit SDSS correlation function with 3 likelihood parameters (two bias and one derivative of logarithmic growth rate). These parameters just determine the perturbation theory correlation function which is being fit to the SDSS correlation function. The module is defined analogues to the supernovae module. The parameters are treated similar to alpha_SNLS and beta_SNLS of supernovae_SNLS.f90. These parameters are defined as DataParams. The range is defined in their own data input file (rsd.ini) and used for optimization using the Datalikelihood class.
Here is my data input parameter file.

Code: Select all

 
use_RSD = T
use_RSDDR11 = T

param[F1_RSD]   = 0.7 0.2 1.1 0.01  0.01
param[F2_RSD]   = 0.4 0   1   0.01  0.01
param[fvel_RSD] = 0.4 0.2 1.2 0.01 0.01
which is equivalent to the supernovae input file (batch1/SNLS.ini)

Code: Select all

use_SN = T
use_SNLS = T

param[alpha_SNLS]=1.442 0.6 2.6 0.11 0.11
param[beta_SNLS]=3.262 0.9 4.6 0.11 0.11

btw: what did you mean exactly by saying "3 extra parameters which are entirely handled in the data likelihood"? Have you had any of these three params recognised in files: param_def, param_CMB and CMB_Cls_simple?
I meant that I never had to declare them anywhere other than the input file (rsd.ini) and the likelihood subroutine (see elow). Here is how they are invoked:

Code: Select all

    !--------------------------
    ! Routine for Log likelihood
    !--------------------------
    FUNCTION rsd_LnLike(like, CMB, Theory, DataParams)
    Class(RSDLikelihood) :: like
    Class (CMBParams) CMB
    Class(TheoryPredictions) Theory
    .....
    .....


    real(mcp) DataParams(:)
    REAL(mcp) :: rsd_LnLike , Likelihood

   .....
   .....
   !reading the RSD parametrs, These are my three extra parameters
    F1  = DataParams(1)
    F2  = DataParams(2)
    fvel= DataParams(3)

    **** Do some perturbation theory with these parameters****

    rsd_LnLike=Likelihood
    END FUNCTION rsd_LnLike

Now for this reason I don’t need to change anything in paramdef.F90 or params_CMB.f90 in the source directory.
Do you think that I should still declare this parameters to few more places?

Thanks,
Shadab

Antony Lewis
Posts: 1936
Joined: September 23 2004
Affiliation: University of Sussex
Contact:

Re: checkpoint files not getting created

Post by Antony Lewis » February 26 2014

I think either your chains are never properly moving, or you didn't compile and run with MPI?

shadab alam
Posts: 14
Joined: November 06 2013
Affiliation: CMU

checkpoint files not getting created

Post by shadab alam » February 27 2014

Dear Antony,

I am compiling it with -DMPI FLAG. here is the Flag from my make file

Code: Select all

F90C = mpiifort
FFLAGS = -mkl -openmp -O3 -xHost -no-prec-div -fpp -DMPI
and I am running with mpi. Here is command used to run in my bps script:

Code: Select all

mpirun -machinefile machines -np $NCPU ./cosmomc test.ini > tcam_rsd0211.txt2
About the chains, I am not sure how to check whether they are moving properly or not. I have turned on feedback and printed my datalikelihood with MPIrank after each step. Most of them show a decrease except some shows just one single step. Here is a plot for few such chain's likelihood. The plot shows chi-square with step number for one particular processor/chain (Each chain is run on one processor).

Image
Image

Image

I was running 64 chains each on one processor.

I understand that the *chk file will be created only after 100 steps. But does it check 100 steps for each chain respectively? Because some (around 5) of my chains produced only single data likelihood output and never printed anything again in the output file.

It will be great, If you can suggest crucial things to check in order to know if the chains were run properly.

Thanks,
Shadab

Sheng Li
Posts: 57
Joined: May 26 2009
Affiliation: University of Sussex
Contact:

Re: checkpoint files not getting created

Post by Sheng Li » February 27 2014

shadab alam wrote:Dear Antony,

I am compiling it with -DMPI FLAG. here is the Flag from my make file

Code: Select all

F90C = mpiifort
FFLAGS = -mkl -openmp -O3 -xHost -no-prec-div -fpp -DMPI
and I am running with mpi. Here is command used to run in my bps script:

Code: Select all

mpirun -machinefile machines -np $NCPU ./cosmomc test.ini > tcam_rsd0211.txt2
About the chains, I am not sure how to check whether they are moving properly or not. I have turned on feedback and printed my datalikelihood with MPIrank after each step. Most of them show a decrease except some shows just one single step. Here is a plot for few such chain's likelihood. The plot shows chi-square with step number for one particular processor/chain (Each chain is run on one processor).
.....

I was running 64 chains each on one processor.

I understand that the *chk file will be created only after 100 steps. But does it check 100 steps for each chain respectively? Because some (around 5) of my chains produced only single data likelihood output and never printed anything again in the output file.

It will be great, If you can suggest crucial things to check in order to know if the chains were run properly.

Thanks,
Shadab
Sounds like you have some chains stopped before it reached 100 effective samples.

Also, 64 chains per proc? How many cores for this processor?
Maybe you will be suggested to use 8 chains at most for each proc.
Then try to see if you have the same problem. But, this will not guarantee non-stop chains like you found for these 64 chains run.

shadab alam
Posts: 14
Joined: November 06 2013
Affiliation: CMU

checkpoint files not getting created

Post by shadab alam » March 03 2014

Hello Sheng Li,

I tried checking a few things from your last reply.
Sounds like you have some chains stopped before it reached 100 effective samples.

Also, 64 chains per proc? How many cores for this processor?
Maybe you will be suggested to use 8 chains at most for each proc.
Then try to see if you have the same problem. But, this will not guarantee non-stop chains like you found for these 64 chains run.
I have runned 16 chain on 16 processor. Therefore 1 chain/processor.
It has finished total of ~26000 step. On average, each chain has finished 1600 steps and all of them is showing a decrease in my data chi-square. Each chain finished more than 100 steps. But, I still dont see any *chk file.

Do you mind giving me some other hint/posible reason for this issue?

Thanks,
Shadab

Jason Dossett
Posts: 97
Joined: March 19 2010
Affiliation: The University of Texas at Dallas
Contact:

Re: checkpoint files not getting created

Post by Jason Dossett » March 04 2014

shadab alam wrote: I understand that the *chk file will be created only after 100 steps. But does it check 100 steps for each chain respectively? Because some (around 5) of my chains produced only single data likelihood output and never printed anything again in the output file.
Hi Shadab,

This is exactly the problem. No .chk files will be written until all the chains have burned in and your first MPI communication has happened. If a few chains are stopped, as you indicate above, then no .chk files will ever be written regardless of how long other chains have run. It sounds like, most likely some of your chains are getting caught in an infinite loop (or segfaulting and not killing the job entirely). If it is an infinite loop from non-convergence of an integral or something, you should set up safeguards to reject a parameter set if the problematic loop/procedure runs too long (as most likely the parameter combination you are trying is in a really bad part of the parameter space).

Good luck!

Sheng Li
Posts: 57
Joined: May 26 2009
Affiliation: University of Sussex
Contact:

Re: checkpoint files not getting created

Post by Sheng Li » March 05 2014

shadab alam wrote:Hello Sheng Li,

I tried checking a few things from your last reply.
Sounds like you have some chains stopped before it reached 100 effective samples.

Also, 64 chains per proc? How many cores for this processor?
Maybe you will be suggested to use 8 chains at most for each proc.
Then try to see if you have the same problem. But, this will not guarantee non-stop chains like you found for these 64 chains run.
I have runned 16 chain on 16 processor. Therefore 1 chain/processor.
It has finished total of ~26000 step. On average, each chain has finished 1600 steps and all of them is showing a decrease in my data chi-square. Each chain finished more than 100 steps. But, I still dont see any *chk file.

Do you mind giving me some other hint/posible reason for this issue?

Thanks,
Shadab
Sorry for the late reply.

As I said before, you will ONLY have .chk files generated when each of your chain file (root= you_named_it_to_save_params_txtfile) had 100 records.
Otherwise there will be NO chk file in your directory. To be clear, once any of your chain file has 100 records -- 100 lines of params, you will see the corresponding .chk file for this chain file.

No matter what steps you have described in all these posts meant, CosmoMC will only check the number of the accepted chains which mean in turn your likelihood function if proper for your task.

Therefore, you may know how to 'hack' or play a trick to have your chk generated. That is to say, you can modify the threshold 100 in:

Code: Select all

paramdef.F90
checkpoint_freq = 100 
....
to 10 or some other number subject to how many lines you can find in your chain file.
For example, if you can find the minimal lines (let me assume 10 or less) in some file, then you can change this checkpoint_freq = 2 to 10 or less, so as to examine if your program can actually run properly.

*On this number, threshold, I think there is no theoretical reason to set 100 or 1000 or 10; but just for practical reason to save space and running time for computational intensive program, like program in MPI, CUDA, etc.

If you can not see .chk file neither, then you have to think about your program or your modification was possibly wrong or ill modified.

Also, Antony has already suspected that your chains were not moving properly. It is likely for your case. This is to say your chains had never reached 100 times for accepting.

Besides, you may check the reply from Jason, just above.

shadab alam
Posts: 14
Joined: November 06 2013
Affiliation: CMU

checkpoint files not getting created

Post by shadab alam » March 28 2014

Hello All,

Thanks Sheng Li and Jason for the replies.
I really appreciate your help.

I have been trying to figure-out the reason for not getting *chk file.
The sad part is that I still dont get the *chk files.
But I precisely know the variable which is the reason for not getting the file.

There is a variable called all_burn in paramdef.F90

Code: Select all

    logical, save :: all_burn = .false., done_check = .false., DoUpdates = .false.
This code is part of the condition to check if *chk files should be created or not.

Code: Select all

   if (checkpoint .and. all_burn .and. checkpoint_burn==0 .and. &
    (.not. done_check .or.  mod(sample_num+1, checkpoint_freq*Proposer%Oversample_fast)==0)) then
        done_check=.true.
I have run the code playing with almost everything with the following setting:

Code: Select all

    integer, parameter :: checkpoint_freq = 5  !In paramdef.F90
Important input parameter:

Code: Select all

burn_in = 0
checkpoint = T
I ran the script with the above setting for several days and looked at the number of line is the *txt file for each chain. Most of chain had around 100 lines in the *txt file. Chain with minimum number of line was 10 and maximum line was 152.

Everything else other than all_burn in the condition to create check file was satisfied.

The only place all_burn is update is in paramdef.F90 is

Code: Select all

        if (.not. all_burn) then
            call MPI_TESTALL(MPIChains-1,req, all_burn, stats, ierror)
I dont understand the need of above code. I tried reading about the function MPI_TESTALL but doesn't help much.
It will be great if someone explain me what is happening in the above line of code and why is it required?

Thanks for all the help,
Shadab

Jason Dossett
Posts: 97
Joined: March 19 2010
Affiliation: The University of Texas at Dallas
Contact:

checkpoint files not getting created

Post by Jason Dossett » March 29 2014

Hi Shadab,

That variable is related to the point I made about at least one of your chains being stuck. In order for checkpoint files to be written all of your chains must complete their burn in. Checkpoint frequency does not matter until that happens. I suggest you check your likelihood code for any place where it can get stuck if a bad combination of parameters is introduced. It really looks like that is what is happening.

I don't suggest changing the checkpoint frequency too much because, if you do, there will be a lot of IO too often and it will slow down your runs.

Best,
Jason

shadab alam
Posts: 14
Joined: November 06 2013
Affiliation: CMU

checkpoint files not getting created

Post by shadab alam » April 01 2014

Hi Jason,
Jason Dossett wrote: That variable is related to the point I made about at least one of your chains being stuck. In order for checkpoint files to be written all of your chains must complete their burn in.

But I have set burn_in=0 , Shouldn't this set all_burn =T ?

Thanks,
Shadab

Post Reply