Dependency tracking in Pyro

jeffmax · November 17, 2018, 6:00pm

I have a question about dependency tracking in Pyro. I think my confusion stems from the technical definition of dependence and how batch dimensions work. I am trying to reconcile what I’ve read in the “Aside: Dependency tracking in Pyro” in SVI Part III. One point of confusion for me is that in the tensor shape tutorial when it first talks about batch_shapes, it says

Indices over .batch_shape denote independent random variables, whereas indices over .event_shape denote dependent random variables (ie one draw from a distribution).

But then later on, it seems to be saying that even when something has been declared to be a dimension in batch_shape (as opposed to event_shape), if you want the variables in the dimension to be independent, you have to further annotate it with plate (on dev, 0.3).

In trying to understand, I’ve formed a working hypothesis. Even though it is likely incorrect, maybe spelling out my current understanding would be helpful in pinpointing where I’ve gone wrong: Variables in a single batch dimension (-1) have the potential to be independent (as in they are not draws from a multivariable distribution with a defined covariance matrix), but you still need to further annotate to tell Pyro they are in fact independent. A Pyro program shares batch dimensions (-1, -2) across the whole program (across different sample statements). Variables assigned to different dimensions (ie -1 and -2 are considered independent), but within a single dimension (-2), the variables in there (the size of the dimension) are dependent unless further annotated. The prior sentence is true even if the variables come from separate sample statements.

I’ve created a few situations with corresponding questions that I think would help clarify:

a = Bernoulli(0.5).sample(2)

Are a[0] and a[1] independent?
- Also, Is this the same as doing?

a1 = Bernoulli(0.5).sample();
a2 = Bernoulli(0.5).sample();

If the above are not independent- is there a probabilistic explanation? Is it that a[1] can depend on a[0], but not the other way around? Is it that there could be a common ancestor (as in a bayes network) that we have not specified?

     with pyro.plate("my_plate", 2):
           a = Bernoulli(0.5).sample()

Are a[0] and a[1] independent here?

      with pyro.plate("my_plate1", 1, dim=-1):
             a = Bernoulli(0.5).sample()

      with pyro.plate("my_plate2", 1, dim=-2):
             b = Bernoulli(0.5).sample()

Are a and be independent from one another here?

    with pyro.plate("my_plate1", 2):
        a = Bernoulli(0.5).sample()
        with pyro.plate("my_plate2", 2):
            b = Bernoulli(0.5).sample()

Are a[0] and a[1] independent from one another here?
What about a[0] and b[0]?

eb8680_2 · November 18, 2018, 2:10am

The point you may be getting stuck on is the distinction between the underlying true conditional independence relationships in a probabilistic model and the subset of independence relationships that inference algorithms are aware of and can exploit. Tracking dependence in general probabilistic programs where some random variables may affect the existence of others is a difficult and subtle problem, so Pyro is deliberately conservative in its assumptions about independence in order to avoid introducing inference errors.

The answer to each of your examples is the same: the independence relationships you’re asking about do in fact hold for the models you’ve provided, but from Pyro’s point of view, only the independence relationships explicitly declared by pyro.plate or implied by temporal ordering (i.e. in your final example, b[:, 0] depends on a[0] and not vice versa) are assumed to hold.

This is obviously suboptimal for deriving efficient inference algorithms, so a lot of our ongoing work in Pyro, especially on enumeration, involves identifying and exploiting conditional independence more aggressively.

jeffmax · November 18, 2018, 6:53pm

Thank you. So I think I am starting to understand a bit better. I have a few (possibly repetitive) follow up questions below, but I think if I could confirm the answers to them it would really help solidify my understanding.

With respect to understanding how plate works - adapting the last scenario from my prior message:

  with pyro.plate("my_plate1", 2):
        a =  sample('a', Bernoulli(0.5))
        b = sample('b',  Bernoulli(0.5))
        with pyro.plate("my_plate2", 2):
            c = sample('c', Bernoulli(0.5))

From Pyro’s perspective:
- Can a[0] and b[0] be treated as independent during inference? My guess is no.
- Can a[0] and a[1] be treated as independent during inference? My guess is yes.
- Can a[0] and c[0] independent as during inference? My guess is yes.
If I want to declare to Pyro that 2 Bernoulli random variables can be treated as independent during inference, are the following 3 scenarios equivalent:

for i in pyro.plate("my_plate", 2):
      sample("b_{}".format(i), Bernoulli(0.5))

with pyro.plate("my_plate", 2):
      sample("b", Bernoulli(tensor([0.5,0.5]))

with pyro.plate("my_plate1"):
      sample("b_1", Bernoulli(0.5))

     with pyro.plate("my_plate2"):
          sample("b_2", Bernoulli(0.5))

From an underlying implementation perspective, is there any truth to what I was saying about Pyro using the batch dimensions across sample statements as the way it tracks what it can treat as independent?
Is it true that the call to .independent() is a bit of a different construct than plate- it is more about reshaping a particular distribution (which could involve declaring some samples as independent), but it doesn’t handle independence across sample statements?

Thanks so much for your help!

eb8680_2 · November 19, 2018, 1:28am

Here are three rules you can apply to simplify reasoning about independence annotations in Pyro:

Every random variable is assumed to depend on all previously sampled random variables unless Pyro is informed otherwise.
the joint distributions of all random variables in each slice of a plate context (i.e. a loop iteration or a slice along the plate dimension) are assumed to be conditionally independent given all previous random variables in all enclosing plate slices.
plates used as context managers and not given a value for plate(..., dim=...) allocate new batch dimensions on the left when they are entered.

It might also be helpful to do a bit of background reading on plate notation in graphical models, from which the semantics of pyro.plate is derived.

Can a[0] and b[0] be treated as independent during inference? My guess is no.

No, a[0] and b[0] are in the same slice (0) and by the first rule above b[0] is assumed to depend on a[0].

Can a[0] and a[1] be treated as independent during inference? My guess is yes.

Yes, by the second rule above a[0] and a[1] are in different slices and are therefore independent.

Can a[0] and c[0] independent as during inference? My guess is yes

No, because by the third rule, the leftmost dimension of c corresponds to my_plate2, but applying the second and third rules above, a[1] and c[:, 0] can be, and a[0] and c[:, 1] can be.

If I want to declare to Pyro that 2 Bernoulli random variables can be treated as independent during inference, are the following 3 scenarios equivalent:

I’m not sure what you mean by equivalent, but these will all behave differently. The first version has two sample statements that are marked as independent of one another, the second is a vectorized version of the first that has a single sample statement that is marked as independent along the leftmost dimension, and the third has two sample statements where the second depends on the first just as if there were no plates used, because they have no batch_shapes and the plates have no sizes.

From an underlying implementation perspective, is there any truth to what I was saying about Pyro using the batch dimensions across sample statements as the way it tracks what it can treat as independent?

Sort of - you can think of a vectorized plate context as associating a batch dimension of all sample statements that appear within it.

Is it true that the call to .independent() is a bit of a different construct than plate

.independent() is a somewhat unforunate name originally drawn from TensorFlow distributions that we’re looking to change. What it does is declare dimensions dependent, i.e. move dimensions from the batch_shape of a distribution to its event_shape. See the tensor shape tutorial for more details.

jeffmax · November 19, 2018, 4:04pm

Again, thank you for the comprehensive responses. As I think we (working through this with a co-worker) close in on understanding this, I have another question. If we remove all plates completely from the situation:

sample("b", Bernoulli(tensor([0.5,0.5]))

Is b[1] independent of b[0] here (from Pyro’s perspective during inference)?

To provide some context for this question, in the tensor shape tutorial at It is always safe to assume dependence it contrasts these two examples:

10 dependent samples

pyro.sample("x", dist.Normal(0, 1).expand([10]).independent(1))

10 independent samples

with pyro.iarange("x_iarange", 10):
    pyro.sample("x", dist.Normal(0, 1))

But given this language (also from the tensor shape tutorial):

Indices over .batch_shape denote independent random variables, whereas indices over .event_shape denote dependent random variables (ie one draw from a distribution).

It would seem that just doing this first example, but not calling independent(1) (and therefore leaving the rightmost dimension as a batch dimension, which I believe is the default here), would have created 10 samples that are considered by Pyro as independent during inference.

pyro.sample("x", dist.Normal(0, 1).expand([10]))

In other words, is plate necessary there? If you don’t mark a single sample with independent(1), is independence assumed within the vectorized sample draw?

eb8680_2 · November 19, 2018, 7:20pm

Is b[1] independent of b[0] here (from Pyro’s perspective during inference)?

No. By the first rule from my previous post, Pyro assumes dependence unless you explicitly inform it otherwise by using a plate. Distribution batch_shapes only pass information about independence to Pyro through plate.

jeffmax · November 20, 2018, 1:01am

Got it- thanks for your patience in answering all those questions.

jeffmax · November 29, 2018, 12:51pm

@eb8680_2 I’ve been playing around a bit more after watching the video from youtube about implementing ELBO and tracing the guide to replay for the model- would it be possible to explain a bit more how the knowledge derived from plate is actually used in inference? Is it something that effects the log_probability of the model when some of the data is observed? Is there a spot in the code you could point me too? By the way, the new mini-pyro example is a great addition to the docs, it is very encouraging to see that kind effort spent to assist newcomers.

eb8680_2 · November 30, 2018, 1:36am

If you’re asking about how inference algorithms exploit conditional independence in an abstract sense, have a look at the SVI tutorials and references there for a discussion of variance reduction for stochastic gradient estimators, and a probabilistic machine learning textbook like Bishop or Murphy for an introduction to variable elimination and message-passing. See the (draft) enumeration tutorial for more on how plates are used in Pyro for efficiently enumerating over batched discrete variables as part of message-passing.

Is that what you’re looking for? If not, I can probably be more helpful if you have a more specific question.

jeffmax · November 30, 2018, 2:12pm

Thank you, this is helpful and gives me plenty to go over. I am (very) slowly working through the PGM course on Coursera, I am hopeful some of these topics will be covered in there.

jeffmax · December 7, 2018, 7:51pm

I have a quick comment/question about the tensor shape tutorial. In the section titled It is always safe to assume dependence, the text contrasts how Pyro treats the code with .to_event() versus the code within a plate context (what dependence assumptions Pyro can make, etc), but every time I read this, I always wonder what the behavior is if you don’t do either of those. In other words, if you were to just do

x = pyro.sample("x", dist.Normal(0, 1).expand([10])

or

x = pyro.sample("x", dist.Normal(torch.ones(10) * 0, 1))

or just

 x1 = pyro.sample("x1", dist.Normal(0, 1))
 ..
 x10 = pyro.sample("x10", dist.Normal(0, 1))

(all outside of a plate context).

I think elaborating on this example might be helpful to newcomers trying to reconcile how these concepts work. For example, I am still a bit confused why calling .to_event() would be different than doing nothing given the following text in the Aside: Dependency tracking in Pyro section of SVI Part III:

“If random variable z1 follows z2 in a given stochastic function then z2 may be dependent on z1 and therefore is assumed to be dependent.”

eb8680_2 · December 8, 2018, 1:04am

Sorry, I’m not sure I understand what you mean by “behavior.” Are you asking about independence? By the first rule from earlier in this thread, things are only declared independent to Pyro if you annotate them with plate, but the following comment from the “It is always safe to assume dependence” section of the tensor shape tutorial still holds in the situation you seem to be asking about:

In practice Pyro’s SVI inference algorithm uses reparameterized gradient estimators for Normal distributions so both gradient estimators have the same performance.

In general, I would encourage you to experiment with some simple end-to-end examples - you’ll probably find that you’re overthinking things, and if you have specific questions about complete examples (e.g. “how do I improve convergence/reduce gradient variance in this model” or “why am I getting this shape error”) I can be much more helpful.

jeffmax · December 19, 2018, 2:54am

Thanks- so by behavior, I meant independence assumptions that Pyro can make when estimating gradients. Specifically, my confusion stems from the following text

x = pyro.sample("x", dist.Normal(0, 1).expand([10]).to_event(1))
assert x.shape == (10,)

This is useful for two reasons: First it allows us to easily swap in a MultivariateNormal distribution later. Second it simplifies the code a bit since we don’t need a plate (see below) as in

with pyro.plate("x_plate", 10):
      x = pyro.sample("x", dist.Normal(0, 1))  # .expand([10]) is automatic
      assert x.shape == (10,)

The difference between these two versions is that the second version with plate informs Pyro that it can make use of conditional independence information when estimating gradients, whereas in the first version Pyro must assume they are dependent (even though the normals are in fact conditionally independent).

This discusses that with the .to_event() version Pyro must assume they are dependent, and in the plate version it can assume they conditionally independent, but it does not say what Pyro does if you were to do neither, that is, if you just wrote:

 x = pyro.sample("x", dist.Normal(0, 1).expand([10]))

I suspect I am overthinking things, because in the examples I have tried, it does not really seem to affect the outcome, but I would just like to have a better grasp on this. In general I struggle with the intuition behind when you would want to use .to_event(1) at all, and I was hoping clarifying this portion of the documentation would make something click for me.

jpchen · December 19, 2018, 4:42am

that’s a good question. then pyro treats that sample as dependent on variables sampled upstream. even though there is a batch dim, pyro does not use this since it is not in a plate. sometimes this is what you want, eg the scale in this example model which relies on broadcasting.

jeffmax · December 19, 2018, 3:25pm

Thank you, a couple of follow up questions:

In this example, with no .to_event() is it true that x[1] will be treated as dependent on x[0], and that the entire x tensor would be treated as dependent on any previous sampled random variables (not in their own plate)? In other words, it is pretty much exactly the same as if you called .to_event(1)?
As a concrete example of my .to_event(1) confusion, in this solution that @fritzo generously posted to a problem I was working through, I do not understand why the .to_event(1) calls are recommended. Unless it is just allowing for the guess_probabilities to dependent on one another (in the problem as defined, they are independent).

I wanted to confirm a couple cases:

c = pyro.sample('c', Bernoulli(0.5))
with pyro.plate("my_plate1", 1):
      a = pyro.sample('a',  Bernoulli(0.5))

with pyro.plate("my_plate2", 1):
      b = pyro.sample('b', Bernoulli(0.5))

is b treated as independent of a by Pyro? What about a treated independent of c? My current understanding from the rules above is that the answer to both of these questions would be no.

Is there an internal data structure I can consult to answer these types of questions myself if I run an an example through Pyro?