Resume the training

Hi - I am trying to run a time-consuming MCMC sampler. I am wondering if it is possible to resume the training after I save and reload the MCMC object.

To be specific, in my case, I need to run many MCMC iterations and each step generally takes a long time. I use GPU node so the maximum walltime of each job is limited. I have to run for some steps and save the results, which hasn’t been converged. Then I want to reload the mcmc object and continue the warmup step. Is it possible? Now each time I run , it seems it starts from scratch.

Are you using Pyro or NumPyro?

I am using Pyro.

First, you might consider switching to NumPyro which is faster at MCMC, NumPyro is so fast you might not need checkpointing.

Second I believe Pyro’s MCMC does not currently support checkpointing. You could put up a feature request to add checkpointing to Pyro’s StreamingMCMC, but given that most MCMC users prefer NumPyro over Pyro, it is probably better to spend developer effort adding checkpointing to NumPyro’s MCMC. @fehiepsi do you know if NumPyro’s MCMC already supports checkpointing?

In NumPyro, we can use post_warmup_state as checkpoints.

1 Like