Building gaussian mixture with Relaxed Bernoulli/Categorical

Hi, I am trying to implement a 1-d Gaussian Mixture of two components with Relaxed Bernoulli (binary case of concrete/gumbel-softmax) as the variational posterior.
However, the model cannot converge to the correct result:

p = 0.6
n_sample = 1000
mask = dist.Bernoulli(probs=p).sample((n_sample,))
loc1, loc2 = -6.0, 3.0
scale = 0.5
data = dist.MaskedMixture(mask.bool(),
                         dist.Normal(loc1, scale),
                         dist.Normal(loc2, scale)).sample()

def model(data):
    weights = pyro.param('weights', torch.tensor(0.5))
    locs = pyro.param('locs', torch.randn(2,))
    with pyro.plate('data', len(data)):
        assignment = pyro.sample('assignment', dist.Bernoulli(weights)).long()
        pyro.sample('obs', dist.Normal(locs[assignment], 1.0), obs=data)

T = 0.5
def guide(data):
    with pyro.plate('data', len(data)):
        alpha = pyro.param('alpha', torch.rand(len(data)), constraints.unit_interval)
        pyro.sample('assignment', dist.RelaxedBernoulliStraightThrough(torch.tensor(T), probs=alpha))
def train(data, svi, num_iterations):
    losses = []
    for j in tqdm(range(num_iterations)):
        loss = svi.step(data)
    return losses

def initialize(seed, data, model, guide, optim):
    svi = SVI(model, guide, optim, Trace_ELBO(num_particles=50))
    return svi.loss(model, guide, data)

n_iter = 500
optim = Adam({'lr': 0.1, 'betas': [0.9, 0.99]})
loss, seed = min(
    [(initialize(seed, data, model, guide, optim),seed) for seed in range(100)]
svi = SVI(model, guide, optim, loss=Trace_ELBO(num_particles=50))
losses = train(data, svi, n_iter)



tensor([-0.9745, -0.4087], requires_grad=True)

Is there any way for debugging this model or solving this issue? ( I believe this is caused by local minima problem if my implementation is correct.)

have you tried a lower temperature T?

Yes, I tried that, but it did not help. (To reduce the gradient variance caused by low temperature, I also increase num_particles to 100)

I also tried to initialize the locs parameter to the ground truth mean [-6.0, 3.0]. However, they both shrinkage to somewhere around -0.5 after 1000 iterations:

p.s. Very interestingly, I cannot find any Gumbel Softmax based implementation of mixture model on the Internet.

it may be that RelaxedBernoulliStraightThrough is buggy and/or numerically unstable. this distribution hasn’t seen much usage afaik. have you looked at the implementation?

The implementation looks good to me.

In addition, following the test case of one hot categorical:

I ran the code below

def model():
    p = torch.tensor([0.8])
    pyro.sample('z', Bernoulli(probs=p))

def guide():
    q = pyro.param('q', torch.tensor([0.4]), constraint=constraints.unit_interval)
    temp = torch.tensor(0.05)
    pyro.sample('z', RelaxedBernoulliStraightThrough(temperature=temp, probs=q))

adam = optim.Adam({"lr": 0.1, "betas": (0.95, 0.999)})
svi = SVI(model, guide, adam, loss=Trace_ELBO(num_particles=100, vectorize_particles=True))

losses = []
for k in range(6000):
    loss = svi.step()

# Output: tensor([0.4520], grad_fn=<ClampBackward>)

Clearly, this “test case” failed.

def model(T):
    p = torch.tensor([0.8])
    pyro.sample('z', RelaxedBernoulli(temp, p))

def guide(T):
    q = pyro.param('q', torch.tensor([0.4]), constraint=constraints.unit_interval)
    temp = torch.tensor(T)
    pyro.sample('z', RelaxedBernoulli(temperature=temp, probs=q))

adam = optim.Adam({"lr": 0.001, "betas": (0.95, 0.999)})
svi = SVI(model, guide, adam, loss=Trace_ELBO(num_particles=100, vectorize_particles=True))

losses = []
T = 1.0
for k in range(6000):
    loss = svi.step(T)
    T = max(0.5, T * (0.999 ** k))

# Output: tensor([0.8006], grad_fn=<ClampBackward>)

Model with RB as both prior/posterior works fine.

@xidulu ah yeah that’s interesting and perhaps makes sense. i don’t recall what the original references like this one do, i.e. whether they make the replacement only on the guide side or also on the model side (using pyro language). do you know?

Aha, that’s a tricky question:

The “C.2 WHAT YOU MIGHT RELAX AND WHY” section(page 15.) from the Concrete paper actually discussed different choices of model/prior. (relaxed or not). Their final choice is to use relaxed Bernoulli/Categorical on both model side and guide side. (Meanwhile, they use a trick to acquire a stable evaluation of the kl term in the ELBO).

In the Gumbel-softmax paper, they use un-relaxed prior and relaxed posterior.

This does not seem to be a big deal when training a VAE with discrete latent space, the network will converge anyway. (also VAE does not has a ground truth to recover). However, when it comes to SVI (non-amortized) with Pyro, it seems that the choice should be made carefully.