A few days ago I posted about a very specific problem with normalizing flows. I somehow managed to simplify the problem and I have the following issue:

The aim is to train a probabilistic model p(x) such that I can sample from it and the samples maximise a function E(x). This does somehow not work.

To figure out what the issue is I reverse engineered the problem and I optimised E(x) with normal SGD and I find the parameters X that maximise E.

```
X = tensor([[0.8828, 0.5280, 0.5363, 0.8985],
[0.6880, 1.0427, 1.0345, 0.6723]], requires_grad=True)
```

In my case E(X) = 1 which is the maximum of the function E. (This does not really matter)

Now I would like to train my probabilistic model such that p(X) has maximum likelihood.

```
flow_dist, transform = init_flow(input_dim = 4*2)
optimizer = torch.optim.SGD(transform.parameters(), lr=learning_rate)
for i in range(1000):
lp = -flow_dist.log_prob(X.reshape(2*4))
print(lp.mean().item())
optimizer.zero_grad()
lp.mean().backward()
optimizer.step()
```

Init flow is the function that defines the pyro normalizing spline flow

```
def init_flow(input_dim = 2, count_bins = 8, hidden_dims = [10, 10]):
base_dist = dist.Normal(torch.zeros(input_dim), torch.ones(input_dim), validate_args=False)
hidden_dims = input_dim*np.array(hidden_dims)
param_dims = [count_bins, count_bins, count_bins - 1, count_bins]
hypernet = AutoRegressiveNN(input_dim, hidden_dims, param_dims=param_dims)
transform = T.SplineAutoregressive(input_dim, hypernet, count_bins=count_bins)
pyro.module("my_transform", transform) # doctest: +SKIP
flow_dist = dist.TransformedDistribution(base_dist, [transform], validate_args=True)
return flow_dist, transform
```

This optimisation spectacularly fails. Dependent on the learning rate, if it is quite big (lr = 0.1) I end up with log likelihoods of around -300. If I choose it smaller, the training starts converging, but then at some point the training becomes very unstable and the value of `lp`

jumps up. Does anyone know why this happens? Is there any issue with maximising likelihoods? If I change the sign of `lp`

the optimisations converges to very low likelihoods. Therefore, it is somehow possible to make the parameters `X`

unlikely, but not more likely.

To check if the model has problems to learn this specific parameters, I even trained it the following to double check:

```
for i in range(1000):
lp = (flow_dist.rsample(torch.tensor([100])) - X.reshape(n_qubits*n_layers))**2
print(lp.mean().item())
optimizer.zero_grad()
lp.mean().backward()
optimizer.step()
```

I train the samples of the flow model to become `X`

with a simple mean square error and amazingly this works very well.

Does anyone know what the problem is?