Error when sampling begins with multiple chains

fehiepsi · July 13, 2019, 12:39pm

@jboyml This is a long standing PyTorch issue file_descriptor sharing strategy may be leaking FDs, resulting in DataLoader causing `RuntimeError: received 0 items of ancdata` · Issue #973 · pytorch/pytorch · GitHub which I couldn’t find a good solution for it. Could you try to clean the folder /dev/shm before running the script ?

find /dev/shm -name torch* -delete

While running, you can use

watch "ls -1 /dev/shm | wc -l"

to see how many torch files created in that folder. Using new API and set_sharing_strategy('file_system') helps in my case but I think that they do not resolve the root problem. From various PyTorch topics, it seems that using Thread instead of Process might be helpful but I don’t have enough background to dive in that direction. Btw, could you make a full replicable script so we can dive into it again? I think that it might also help if we (optionally) add some checkpoints to consume samples and release shared resources of subprocesses. In the mean time, I’ll try to find if there is an easier-to-replicate script in my old notebooks.

Edit: I can replicate the memory issue with the script in MCMC does not work well with multi-chains in CPU due to Memory Error · Issue #1730 · pyro-ppl/pyro · GitHub by setting high num_samples. I’ll dive into this again to see if there are better solutions.