Using Normalizing Flows for RL-like optimisations very unstable

Hi everyone,

I am very sorry, I did not know that there is no Latex support. I hope it is still readable.

I am struggling with the following problem since a few weeks now. I want to find an optimal solution for a variable E(x) by sampling it from a probabilistic network p_{W}(x) parametrised by W (I am using pyro’s normalizing spline flows). x is a multidimensional real vector. Therefore my cost function takes the form:

\langle E \rangle = \sum_{x \~ p_{W}(x)} E(x)

And the gradients can be obtained with the log-likelihood trick and read:

\partial / \partial W_i \langle E \rangle = \sum_{x \~ p_{W}(x)} \nabla_{wi} \log p_W(x) E(x)

The problem with my optimisation is, that it is very unstable. It often happens, that it starts well and goes to zero and the suddenly the gradients start exploding and there are spikes in the loss and in the end the optimisation ends up in a spurious minima.

I tried many things, e.g. adapting the variance of the gradient by subtracting \langle \log p_W(x) \rangle \langle E(x) \rangle. Clipping Gradients, add L2 regularisation, I even tried to use an approximation of the Jacobian to regularise the optimisation inspired by this paper.

Does anyone know what the issue is here? And how I can improve training. E(x) is a smooth function, no discontinuities. And the normalizing flow works just fine if I train it one some dataset as in the examples on pyro. The issue really comes from the fact that I combine the flow p(x) with the function E(x).

hello can you please provide more details? i’m afraid this description is a bit vague, making it hard to offer useful suggestions. what is E(x)? what is the dimension of x? what is the dimension of W? etc

Sure, I am sorry if it was too vague.

I use autoregressive NN for the neural spline flow, similar to the example from pyro

def init_flow(hidden_dims = [10, 10], input_dim=10):
    count_bins = 8
    base_dist = dist.Normal(torch.zeros(input_dim), torch.ones(input_dim), validate_args=False)
    hidden_dims = input_dim*np.array(hidden_dims)
    param_dims = [count_bins, count_bins, count_bins - 1, count_bins]
    hypernet = AutoRegressiveNN(input_dim, hidden_dims, param_dims=param_dims)
    transform = T.SplineAutoregressive(input_dim, hypernet, count_bins=count_bins, bound = 10)
    pyro.module("my_transform", transform)  # doctest: +SKIP
    flow_dist = dist.TransformedDistribution(base_dist, [transform], validate_args=True)

    return flow_dist, transform

And the function E(x) is a function with many local minima. The idea is that with the probabilistic model we sample many different regions of the parameter space, to eventually converge to a good local minima or maxima.

With the following function one can reproduce the problem:

def random_fct(params):
    fcts = [torch.sin, torch.cos]
    sums = []
    signs = []
    for j in range(len(params)):
        prods = []
        for i in range(len(params)):
            f = np.random.choice(fcts)
            prods.append(f)
        signs.append((-1)**np.random.randint(0,2))
        sums.append(prods)
    return sums, signs

def test_fct(params, signs, sums):
    S = 0
    for j in range(len(params)):
        prods = 0
        for i in range(len(params)):
            prods *= sums[j][i](params[I])**2
        S += prods*signs[j]
    return S

The optimisation is done with the following loop:

learning_rate = 0.2
epochs = 100
batch= 50

flow_dist, transform = init_flow(input_dim = n_qubits*n_layers)

optimizer = ClippedSGD(transform.parameters(), lr = learning_rate, clip_norm = 0.1)

sums, signs = random_fct(params[0][0])
test_fct(params[0][0], signs, sums)


train_progress = []
lp_progress = []
exp_progress = []


norm_dist = dist.Normal(torch.zeros(input_dim), torch.ones(input_dim), validate_args=False)

for i in tqdm(range(epochs)):
    norm_samples = norm_dist.sample(torch.tensor([batch, 1])).detach()
    params = transform(norm_samples).detach()
        
    expects = torch.tensor([test_fct(params[i][0], signs, sums) for i in range(len(params))])    
    lp = flow_dist.log_prob(params)
    
    log_expect = lp.reshape(expects.shape)*(expects.detach())

    loss = -(log_expect.mean()) 
    
    optimizer.zero_grad()
    loss.backward(retain_graph=True)

    optimizer.step()
            
    
    train_progress.append(loss.item())
    lp_progress.append(lp.mean().item())
    exp_progress.append(expects.mean().item())

plt.plot(train_progress)
plt.show()

plt.plot(lp_progress)
plt.show()

plt.plot(exp_progress)
plt.show()

The clipped SGD is defined as

class ClippedSGD(Optimizer):
    def __init__(self, params, lr: float = 1e-3,
                 eps: float = 1e-8, clip_norm: float = 1.0):
        defaults = dict(lr=lr, eps=eps, clip_norm=clip_norm)
        super().__init__(params, defaults)

    def step(self, closure: Optional[Callable] = None) -> Optional[Any]:
        loss = None
        if closure is not None:
            loss = closure()

        for group in self.param_groups:

            for p in group['params']:
                if p.grad is None:
                    continue
                grad = p.grad.data
                grad.clamp_(-group['clip_norm'], group['clip_norm'])

                denom = torch.tensor([1.0])

                p.data.addcdiv_(grad, denom, value=-group["lr"])

        return loss

With standard SGD the gradients immediately explode and the optimisation fails.

why are you transforming the data you’re trying to fit?

I am not sure if I understand the question. Do you mean, why I am not directly optimising over the function E(x) or the parameters x but I take the detour of sampling it from a distribution?

The reason for this is that we want to study optimisations where we don’t have access to the gradient of E(x) because it is much costlier than taking the gradient of p(x).

if i’m reading your code correctly you appear to be fitting the normalizing flow to itself.

normal mode of operation:

  • data x is fixed
  • fit parameters of flow to model p_empirical(x)

your apparent mode of operation:

  • sample normal noise
  • push through transform so you get samples from p_flow
  • fit p_flow to said samples

The idea is to maximise the expectation value of E(x). Which reads <E(x)> = sum p(x) E(x).

The gradient then is sum log(p(x)) E(x).

So basically I expect the algorithm to increase the probability or log likelihood of samples that have a high value for E(x) and decrease the probability with a low value of E(x).

Maybe I am doing sth conceptually wrong. But I am pretty sure this is the way to do it. Any thoughts?

what is x? where does it come from? in a typical application of a normalizing flow you have a bunch of vectors x_i, say in three-dimensional space, and they follow some distribution, e.g. an oblong blob centered at (1,2,3). the flow is then fit to that data, with the result that you have a continuous probability distribution that approximates the empirical distribution over the x_i: p_flow \approx p_empirical

Yes, this I understand. But now I don’t have any data x_i but a function E that depends on x (just some real vector) and I want to find the x that maximises E. But instead of maximising E(x) directly through SGD or similar, I want to optimise the sampling from p(x) such that I sample the parameters of E that maximize E.

I think a similar example would be reinforcement learning and E would be the reward signal and p would be the policy.But it could be that normalizing flows are not really used for such purposes and we need to find a different model.

in that case you should directly be sampling x’s from the domain of E(x) (e.g. uniformly from the unit cube in D dimensions) and trying to maximize something like Expectation_p(x) [ E(x) ] (in which case you directly invoke log_prob in your objective function) or, alternatively, maximizing the same thing by sampling from the flow but having an objective that only depends on E(x)

note doing something like this with a normalizing flow can potentially be a bit dangerous because with a sufficiently large neural network your p(x) would start looking like a delta function at the true maximum (assuming it’s unimodal). which presumably isn’t what you want. and the only thing that will prevent that from happening (apart from the difficulty of optimization) is your choice of neural network capacity (which is hard to reason about). of course you can also add a entropy term or the like to discourage collapse

Ok, thanks a lot for your help. I will give this a try.

About the delta functions: This is actually something we would like to see to understand the function E(x) better. There should be no numerical problems with delta functions and normalizing flows, are there?

can you specify your question in more detail? an actual delta function doesn’t possess a finite pdf, consequently can’t be meaningfully differentiated by e.g. pytorch. all gradient based learning methods rely on some amount of smoothness and so things like delta functions generally cause problems

Ok, yes that makes sense. I was more thinking of an approximation of a delta function that still has some finite variance and also a finite pdf. I think my question was more in the sense of: Can a normalizing flow learn a just a single data point. And I assume that should be possible.

yes in principle. but in practice you might run into all kinds of optimization/numerical issues