Preliminary feedback / questions

I’ve been experimenting with Pyro for only a few days, but I wanted to share some personal considerations before going on holiday.

– Why is it necessary to to manually specify independence with pyro.iarange() and friends?

This could be accomplished automatically, either via a tree-like structure or by superclassing Variable, and propagating the dependencies so that pyro.sample() statements have the required information to establish the set of parents for the random variable in question.
I’m probably missing something here, are there performance issues with this kind of approaches?

– It’s possible to write code that doesn’t raise errors or warnings but results in incorrect inference.

For instance, at some point I refactored the construction of a specific kind of guide distribution into a function.
The code in question contained calls to pyro.param(), and I ended up calling this function and storing the returned guide distribution as data member during initialization time (of the object exposing model() and guide()).
This didn’t produce any error or warning but the ELBO didn’t get optimized correctly.
This kind of issues are possible in PyTorch too (wrt. correct backpropagation), but this seems to be more of a problem in Pyro, and probably linked to the use of strings for parameters identification (I have been following the issue on this topic on GitHub, and I’m too not 100% convinced that the current approach is robust).

– More specifically, what parts of the model and the guide are required to be created inside of model() and guide() functions passed to SVI() and which parts can be safely cached outside?

That’s not immediately obvious, for instance the examples in the tutorials usually create PyTorch Variables (with require_gradient set to True) at every call of guide(), while one might assume that there would be a need to create them once only, that’s what you would do when optimizing a loss in PyTorch.

– I would welcome integration with Pytorch’s DataLoader, which is the standard mechanism of batching in PyTtorch.

I couldn’t find information on this in the tutorials, and it’s not clear to me wether it’s supported.

– I would welcome “debug” info for the algorithm used in inference when calling SVI.step().

It’s nice to have a framework that automatically takes care of various parts of inference, it’s not always evident what is happening under the hood.
For example, for Gamma variables as guide it’s still possible to rely on a parametrization trick (via approximate inverse cdf or the generalized reparametrization trick), but the user does’t know whether this is the case or not.

Thank you for open-sourcing your work, I’m looking forward to future developments.

Hi @steplu, feel free to file most of these as feature request issues!

Why is it necessary to to manually specify independence…? … superclassing Variable

You’re right and we’re looking into superclassing Variable. One issue is that the requirement to use pyro.Variables might end up being more intrusive than pyro.irange (e.g. how do you deal with a nn.Module?). If you have ideas for additional clean abstractions, we’re open to ideas. We’re also trying to open up pyro.contrib to highly-speculative features.

Why is it necessary to to manually specify independence…? … superclassing Variable…

since we support arbitrary python code in models, superclassing Variable or the like will not be sufficient in the general case. for example if there is control code in python. we intend to build-out ways of tracking dependency that make more use of the pytorch graph, but, again, any such construct won’t be fully general.

thanks for the feedback. we should aim to make this clearer in the documentation/tutorials.

the variable can be created wherever. however, the pyro.param statement needs to be inside the model/guide in which it’s being used. the param call says something like “hey, this is a parameter named foo; the first time foo is seen the given variable is registered with pyro. on subsequent calls to pyro.param(“foo”, …) the second argument is ignored and this call simply grabs foo from the ParamStore.” since arbitrarily different sets of parameters can occur in different calls to model() or guide() it is necessary that the parameters are specified this way within each invocation of model/guide.

You’re right and we’re looking into superclassing Variable. One issue is that the requirement to use pyro.Variables might end up being more intrusive than pyro.irange (e.g. how do you deal with a nn.Module?). If you have ideas for additional clean abstractions, we’re open to ideas. We’re also trying to open up pyro.contrib to highly-speculative features.

You’re right it’s best to keep intrusiveness at a minimum, the alternative I thought of was to use the AD graph instead to propagate dependencies (not sure which modifications to PyTorch internals would be required).

since we support arbitrary python code in models, superclassing Variable or the like will not be sufficient in the general case. for example if there is control code in python. we intend to build-out ways of tracking dependency that make more use of the pytorch graph, but, again, any such construct won’t be fully general.

I’m not sure I’m following here, consider the following (pseudo-code) example:

x_mu = 0.0
if condition_1:
    z = pyro.sample('z', Normal, ...)
    if condition_2:
        x_mu = z
x = pyro.sample('x', Normal, x_mu, 1.0)

Here x depends on z, but if condition_1 is false for that specific trace pyro is not aware that z exists , even with the current approach of conservative dependency discovery.

If instead condition_1 is true and condition_2 is false, then the current approach would take into account the dependency while a graph-based approach would not.

Is this what you meant?

yes. i’m referring to dependency “within a trace”

Still, if condition_1 is false both approaches would not discover the dependency.
Are there reasons to diversify between the two cases listed above?

well, for a general program discovering all the dependencies in the most general sense might require running the program more times than the age of the universe allows. but knowing the kind of dependency i’m talking about still allows us to, e.g. build gradient estimators with reduced variance