A few days ago I posted about a very specific problem with normalizing flows. I somehow managed to simplify the problem and I have the following issue:
The aim is to train a probabilistic model p(x) such that I can sample from it and the samples maximise a function E(x). This does somehow not work.
To figure out what the issue is I reverse engineered the problem and I optimised E(x) with normal SGD and I find the parameters X that maximise E.
X = tensor([[0.8828, 0.5280, 0.5363, 0.8985], [0.6880, 1.0427, 1.0345, 0.6723]], requires_grad=True)
In my case E(X) = 1 which is the maximum of the function E. (This does not really matter)
Now I would like to train my probabilistic model such that p(X) has maximum likelihood.
flow_dist, transform = init_flow(input_dim = 4*2) optimizer = torch.optim.SGD(transform.parameters(), lr=learning_rate) for i in range(1000): lp = -flow_dist.log_prob(X.reshape(2*4)) print(lp.mean().item()) optimizer.zero_grad() lp.mean().backward() optimizer.step()
Init flow is the function that defines the pyro normalizing spline flow
def init_flow(input_dim = 2, count_bins = 8, hidden_dims = [10, 10]): base_dist = dist.Normal(torch.zeros(input_dim), torch.ones(input_dim), validate_args=False) hidden_dims = input_dim*np.array(hidden_dims) param_dims = [count_bins, count_bins, count_bins - 1, count_bins] hypernet = AutoRegressiveNN(input_dim, hidden_dims, param_dims=param_dims) transform = T.SplineAutoregressive(input_dim, hypernet, count_bins=count_bins) pyro.module("my_transform", transform) # doctest: +SKIP flow_dist = dist.TransformedDistribution(base_dist, [transform], validate_args=True) return flow_dist, transform
This optimisation spectacularly fails. Dependent on the learning rate, if it is quite big (lr = 0.1) I end up with log likelihoods of around -300. If I choose it smaller, the training starts converging, but then at some point the training becomes very unstable and the value of
lp jumps up. Does anyone know why this happens? Is there any issue with maximising likelihoods? If I change the sign of
lp the optimisations converges to very low likelihoods. Therefore, it is somehow possible to make the parameters
X unlikely, but not more likely.
To check if the model has problems to learn this specific parameters, I even trained it the following to double check:
for i in range(1000): lp = (flow_dist.rsample(torch.tensor()) - X.reshape(n_qubits*n_layers))**2 print(lp.mean().item()) optimizer.zero_grad() lp.mean().backward() optimizer.step()
I train the samples of the flow model to become
X with a simple mean square error and amazingly this works very well.
Does anyone know what the problem is?