Test results change when using validation in DKL for classification

Dear Pyro users,

I am trying to use DKL in pyro following the tutorial in here. The task is a binary classification.
For the setup:

cnn = classifier()
rbf = gp.kernels.RBF(input_dim=num_features, lengthscale=torch.ones(num_features))
deep_kernel = gp.kernels.Warping(rbf, iwarping_fn=cnn)
Xu = torch.from_numpy(retrieve_inducing_points(X_train, y_train, 128).reshape(-1, 1, ndimension, ndimension).astype(np.float32))
likelihood = gp.likelihoods.Binary()
latent_shape = torch.Size([])
gpmodule = gp.models.VariationalSparseGP(X=Xu, y=None, kernel=deep_kernel, Xu=Xu,
likelihood=likelihood, latent_shape=latent_shape,
num_data=X_train.shape[0], whiten=True, jitter=2e-4)
gpmodule.cuda()
optimizer = torch.optim.Adam(gpmodule.parameters(), lr = 0.001)
scheduler = StepLR(optimizer, step_size = 9, gamma = 0.5)
elbo = infer.TraceMeanField_ELBO()
loss_fn = elbo.differentiable_loss

retrieve_inducing_points is simply a function I defined in order to retrieve the inducing points.

The training looks like:

pyro.clear_param_store()
for epoch in tqdm(range(1, epochs + 1)):
train(train_loader, gpmodule, optimizer, loss_fn, epoch)
with torch.no_grad():
acc.append(valid(valid_loader, gpmodule))

My issue is that the results change according to whether I run the validation in the training loop, which is a bit strange to me as it should not affect my final test (to my understanding). Does anyone have an idea why it is the case?

Many thanks in advance.

Haga

@Haga Does it make a huge change? If there are random statements in your validation code, the result will vary.

@fehiepsi

The validation is practically similar to testing:
def valid(test_loader, gpmodule):
correct = 0
gpmodule.eval()
for data, target in test_loader:
data, target = data.cuda(), target.cuda()
# get prediction of GP model on new data
f_loc, f_var = gpmodule(data)
pred = gpmodule.likelihood(f_loc, f_var)
correct += pred.eq(target).long().cpu().sum().item()
return 100. * correct / len(test_loader.dataset)

except that in the test I need to have probability

def test(test_loader, gpmodule):
correct = 0
prediction = []
proba = []
gpmodule.eval()
for data, target in test_loader:
data, target = data.cuda(), target.cuda()
f_loc, f_var = gpmodule(data)
h = dist.Normal(f_loc, f_var.sqrt())()
pred = gpmodule.likelihood(f_loc, f_var)
prediction.append(pred.cpu().numpy())
proba.append(h.cpu().numpy())
return np.array(prediction), np.array(proba)

When including validation I get:

accuracy achieved: 0.9090909090909091
F1 score achieved: 0.9195402298850575
recall achieved: 0.9302325581395349
precision achieved: 0.9090909090909091
specificity: 0.8823529411764706
average precision: 0.9409203758495305
roc auc: 0.9500683994528043

without validation I get:

accuracy achieved: 0.9090909090909091
F1 score achieved: 0.9176470588235294
recall achieved: 0.9069767441860465
precision achieved: 0.9285714285714286
specificity: 0.9117647058823529
average precision: 0.9468715886721786
roc auc: 0.9480164158686731

Thanks for the reply

It seems to me that the result is similar. It is likely that the random statements in the validation code affect the result. For example, if you call

def f():
    random()
    return random()

you will get different result from

def f():
    return random()

One source of randomness in validation code is: gpmodule.likelihood(f_loc, f_var).

Thanks for your reply.

Indeed you are right, the source of randomness is validation accounts for the difference.
I am more interested in both specificity and recall which are more affected. Given that roc and average precision are not that sensitive here, I guess I can have a look at the threshold value which is the default 0.5.

Many thanks @fehiepsi , I appreciate it.

1 Like