Multivariate regression not converging

Dear Pyro Experts,

I am new to Pyro and probabilistic programming in general. I am trying to fit a multivariate regression with 40 covariates and 82 response variables. My simple model is as follows:

    def model(x, y=None):
        weight = pyro.sample("weight", dist.Normal(torch.zeros(40,82),torch.ones(40,82)).to_event(2))
        bias = pyro.sample("bias", dist.Normal(torch.zeros(82,),torch.ones(82,)))
        mu = torch.matmul(x,weight) +  bias
         with pyro.plate("data", x.shape[0]):
            obs = pyro.sample("obs", dist.MultivariateNormal(loc=mu, scale_tril=torch.eye(82)), obs=y)
     guide = AutoMultivariateNormal(model)

     adam = pyro.optim.Adam({"lr": 0.0001})
     svi = SVI(model, guide, adam, loss=Trace_ELBO())
     for j in range(40000):
        loss = svi.step(guideMatrix_trainTensor, expressionMatrix_trainTensor)

However my model does not converge, loss values fluctuate back and forth. When I do a simple regression with the same matrices with one response variable at a time, I get the results with no problem. May I ask what may be going wrong in here? Also is there a way to define priors over the multivariate covarince matrix?

Thanks in advance.

Hi @Basak,

I’d recommend @martinjankowiak’s recent tips tutorial. Additionally here are some specific tips in order of priority:

  1. First try an AutoDelta, then an AutoNormal, then an AutoMultivariateNormal guide. These progress from cheap-and-robust to accurate-but-expensive-and-numerically-sensitive.
  2. Try decreasing the initial scale parameter of your autoguide:
    guide = AutoMultivariateNormal(model, init_scale=0.01)  # or even 0.001
  3. Make sure all your data inputs and outputs have been scaled to have mean on the order of zero and standard deviation on the order of 1. E.g. 10 ± 0.1 is ok, but 1e6 ± 1e7 will probably fail. This is needed because deep learning optimizers
  4. Since your likelihood is diagonal, it will be cheaper to use a Normal.
  5. I’d recommend also learning the observation noise parameter, say obs_scale.
  6. I’d recommend a higher learning rate (say 5e-3) and fewer steps (say 1000); if that doesn’t work on a shallow model like this, then something else is probably wrong.

Here are all my suggestions combined:

# Standardize data if needed.
x = guideMatrix_trainTensor
y = expressionMatrix_trainTensor
x_std = (x - x.mean(0)) / x.std(0)
y_std = (y - y.mean(0)) / y.std(0)

def model(x, y=None):
    weight = pyro.sample("weight", dist.Normal(torch.zeros(40,82), 1).to_event(2))
    bias = pyro.sample("bias", dist.Normal(torch.zeros(82), 1).to_event(1))
    obs_scale = pyro.sample("obs_scale", dist.LogNormal(0, 1))
    mu = x @ weight + bias
    with pyro.plate("data", x.shape[0]):
        pyro.sample("obs", dist.Normal(mu, obs_scale).to_event(1), obs=y)

# Try a few different guide types.
guide1 = AutoDelta(model)
guide2 = AutoNormal(model, init_scale=0.01)
guide3 = AutoMultivariateNormal(model, init_scale=0.01)
guide = guide1  # Change to guide2 and guide3 once you get this working.

adam = pyro.optim.Adam({"lr": 0.005})
for step in range(1000):  # Increase as needed.
    loss = svi.step(x, y)
    if step % 100 == 0:
        print(f"step {step} loss = {loss}")

Thank you so much @fritzo, I appreciate your help very much. My inputs and outputs are already given as z-score values, but I will try your other suggestions. One last question: what are the implemented distributions that can be used for the covariance matrices? I saw an example usage of LKJ in one of the previous posts as :

def model(x, y=None):
    β = pyro.sample("β", dist.Normal(torch.zeros(x.shape[1],2),torch.ones(x.shape[1],2)).to_event(2))
    B0 = pyro.sample("B0", dist.Normal(torch.zeros(2,),torch.ones(2,)))
    μ = torch.matmul(x,β) +  B0
    θ = pyro.sample("θ", dist.HalfNormal(torch.ones(2)))
    η = pyro.sample("η", dist.HalfNormal(1.))
    L = pyro.sample("L", dist.LKJCorrCholesky(2, η))
    L_Ω =θ.sqrt()), L)
    with pyro.plate("data", x.shape[0]):
        obs = pyro.sample("obs", dist.MultivariateNormal(loc=μ, scale_tril=L_Ω), obs=y)

Is there another way to define the prior for the covariance matrix? I am asking this because my outputs actually have a covariance structure and I would like the model to use that information. How can I incorporate that to the model?

Hi @Basak,

Hmm I’m not sure I understand. If you have a known covariance cov matrix for the conditional distribution y|x then you could use that in the likelihood

pyro.sample("obs", dist.MultivariateNormal(mu, cov), obs=y)

Note it is cheaper to precompute the cholesky sqrt outside of inference

scale_tril = torch.cholesky(cov)
def model(...):
    pyro.sample("obs", dist.MultivariateNormal(mu, scale_tril=scale_tril), obs=y)