VAE example different loss in different run

ttc · July 26, 2018, 2:05pm

Hi everybody,

I’m experimenting with the pyro’s VAE example. Everything is just copy and paste to my notebook. The main training loop is as follows:

# setup the VAE
vae = VAE(use_cuda=use_cuda)

# setup the optimizer
adam_args = {"lr": learning_rate}
optimizer = Adam(adam_args)

# setup the inference algorithm
svi = SVI(vae.model, vae.guide, optimizer, loss=Trace_ELBO())

train_elbo = []
# training loop
for epoch in range(num_epochs):
    # initialize loss accumulator
    epoch_loss = 0.
    # do a training epoch over each mini-batch x returned
    # by the data loader
    for _, (x, _) in enumerate(train_loader):
        # if on GPU put mini-batch into CUDA memory
        if use_cuda:
            x = x.cuda()
        # do ELBO gradient and accumulate loss
#         epoch_loss += svi.step(x)
        batch_loss = svi.step(x)
        epoch_loss += batch_loss
#         print('batch loss', batch_loss)

    # report training diagnostics
    normalizer_train = len(train_loader.dataset)
    total_epoch_loss_train = epoch_loss / normalizer_train
    train_elbo.append(total_epoch_loss_train)
    print("[epoch %03d]  average training loss: %.4f" % (epoch, total_epoch_loss_train))

Everything work fine if I run this part of the code the first time. I have the loss logging:

[epoch 000] average training loss: 190.9459
[epoch 001] average training loss: 146.3057
[epoch 002] average training loss: 132.5690
[epoch 003] average training loss: 124.1392
[epoch 004] average training loss: 119.2743
[epoch 005] average training loss: 116.0978
[epoch 006] average training loss: 113.8199
[epoch 007] average training loss: 112.2251
[epoch 008] average training loss: 110.9388
[epoch 009] average training loss: 109.9328
[epoch 010] average training loss: 109.1249

But if I run the same code the second time then the loss explodes!

[epoch 000] average training loss: 585.9169
[epoch 001] average training loss: 585.8450
[epoch 002] average training loss: 585.9478
[epoch 003] average training loss: 585.8123
[epoch 004] average training loss: 585.9065
[epoch 005] average training loss: 585.8792
[epoch 006] average training loss: 585.8365
[epoch 007] average training loss: 585.9029
[epoch 008] average training loss: 585.8711
[epoch 009] average training loss: 585.7444
[epoch 010] average training loss: 585.9552

Something must have been saved internally in Pyro that change the result in the second run. Are there anybody having the same problem like me?

Thanks you in advance!!!

rgreen1995 · July 26, 2018, 4:00pm

I had a similar problem, I’m not exactly sure what’s causing the problem is but I found restarting my kernel after each full training run fixed it

martinjankowiak · July 26, 2018, 4:29pm

be aware that the pyro paramstore is a global state.

you may need to invoke pyro.clear_param_store() if you’re doing things inside a REPL.

see the docs for more information: http://docs.pyro.ai/en/0.2.1-release/parameters.html

ttc · July 27, 2018, 8:54am

Thanks, it works! One question: What do you mean by REPL?

martinjankowiak · July 27, 2018, 2:51pm

@ttc REPL