Coefficients exploring solution space within constraints

I’ve got a model that uses an SVI engine. The dataset has millions of rows and a very complex design. Some of the variables we know will ultimately have a coefficient very close to 0. When we only constrain the coefficient distributions to be positive, we get expected behavior (i.e. the coefficients start very large with the first iteration and then shrink consistently over time until they settle at a value very close to 0, never really fluctuating). However, if we put an upper bound on the distribution that is very small, we see the coefficient mu’s fluctuate throughout the constrained range rather than generally moving in one direction. No matter which way the coefficients are moving, the loss continues to decrease throughout the entire model training. I have a few questions about what could be happening:

  1. Is there anything in numpyro’s algorithm that encourages exploration of the entire viable solution space? In other words, when we allow the coefficients to start with large values, is the model generally satisfied to let them shrink continually and then settle, but in the highly constrained version it is weighted in such a way to explore all possible solutions?
  2. Is there any danger in allowing some coefficients to be unconstrained and forcing others to be tightly constrained?
  3. Changing step sizes in the optimizer or batch size have not made a difference. Are there other parameters that might help the model navigate such small constraints in a more systematic way?
  4. The loss tends to go negative in both versions of the model (although not at a point in training that correlates with any noticeable trend, such as swings in coefficient values). What is this indicating and is it an issue?

Thanks in advance for your help, numpyro is a great tool, and I am always looking to improve my understanding and how to best utilize it.

hello,

afaik questions 1-3 are more or less impossible to answer because i) we know next to nothing about your model and ii) in any case the dynamics of complex high-dimensional non-convex stochastic optimization is difficult to understand and reason about, especially in general terms.

regarding 4, the elbo is a lower bound to the log marginal likelihood. if your observed data are continuous the latter quantity is the log of a density and not an actual probability—as such it can take any sign and the sign of the elbo has no particular meaning whatsoever