"it is always safe to assume dependence" - to_event vs plate

Hello ! I’m a statistician new to Pyro. I want to understand the documentation for “it is always safe to assume dependence”.

In the model we could have 3 dependency scenarios for observations x:
1x) x = pyro.sample("x", Normal(0, 1).expand([N]))
2x) x = pyro.sample("x", Normal(0, 1).expand([N]).to_event(1))
3x) with pyro.plate("x_plate", N): x = pyro.sample("x", Normal(0, 1))

and similarly for latents z:
1z) z = pyro.sample("z", Normal(0, 1).expand([J]))
2z) z = pyro.sample("z", Normal(0, 1).expand([J]).to_event(1))
3z) with pyro.plate("z_plate", J): z = pyro.sample("z", Normal(0, 1))

I want to understand the consequences of each. Below are my questions and guesses, but I’d appreciate someone rewriting the correct consequences for me.

1x) likelihood assumes independence across x ? SVI algorithm cannot exploit this independence for minibatching or fast gradient computations ?
2x) same ?
3x) likelihood assumes independence across x and SVI can exploit for minibatching and gradient computation ?

1z) prior assumes independence across z ? autoguide is AutoMultivariateNormal ? or mean field AutoDiagonalNormal ?
2z) prior assumes independence across z ? autoguide is AutoMultivariateNormal ?
3z) prior assumes independence across z and autoguide is mean field AutoDiagonalNormal ?

Thank you !!

if you’re using SVI with reparameterizable latent variables like gaussians, the gradient “follows” the dependency structure and so the only reason you’d “need” to use plate is if you’re going to do mini-batching.

if you’re using SVI with discrete latent variables, the gradient estimators are more complicated, and you can get variance reduction by exploiting dependency structure. so in that scenario you’d also want to use explicit plate structure where applicable.

Thank you, @martinjankowiak !

I can’t yet map your answer to my 6 questions 1z, 2z, 3z, 1x, 2x, 3x.

if you’re using SVI with reparameterizable latent variables like gaussians…

This sounds like you’re talking about latent variables z, so my 1z, 2z, 3z above ?

so the only reason you’d “need” to use plate is if you’re going to do mini-batching.

But then this sounds like it’s about observations x, so my 1x, 2x, 3x above.

if you’re using SVI with discrete latent variables, the gradient estimators are more complicated, and you can get variance reduction…

which variance ? variance in estimation of gradients for SVI ? posterior variance ?

https://pyro.ai/examples/svi_part_ii.html
https://pyro.ai/examples/svi_part_iii.html

Thanks again, @martinjankowiak ! These links really help. A few questions from them:

Sequential plate:

Let’s return to the example we used in the previous tutorial

def model(data):
    # sample f from the beta prior
    f = pyro.sample("latent_fairness", dist.Beta(alpha0, beta0))
    # loop over the observed data [WE ONLY CHANGE THE NEXT LINE]
    for i in pyro.plate("data_loop", len(data)):
        # observe datapoint i using the bernoulli likelihood
        pyro.sample("obs_{}".format(i), dist.Bernoulli(f), obs=data[i])

“Subsampling when there are both global and local random variables”:

def model(data):
    beta = pyro.sample("beta", ...) # sample the global RV
    for i in pyro.plate("locals", len(data)):
        z_i = pyro.sample("z_{}".format(i), ...)
        # compute the parameter used to define the observation
        # likelihood using the local random variable
        theta_i = compute_something(z_i)
        pyro.sample("obs_{}".format(i), dist.MyDist(theta_i), obs=data[i])

Note that in contrast to our running coin flip example, here we have pyro.sample statements both inside and outside of the plate loop.

QUESTION 1: But I see pyro.sample() both inside and outside of the plate loop in both examples ?

(more questions to come, splitting them into multiple posts)

not sure how that ended up there. should be removed. please consider submitting a PR to fix : )

1 Like

Pyro leverages (conditional) independence declared by pyro.plate() in model p and/or guide q in two ways:

A) subsampling (sometimes also called “mini-batching”) the data x_i (and local random variables z_i) for reducing computation time.

B) “Rao-Blackwellization” (used to mean variance reduction by conditioning, not in the classical sense with a sufficient statistic) to reduce variance in the Monte Carlo estimates of ELBO gradients by dealing with some stuff analytically rather than using Monte Carlo alone.

“It is always safer to assume dependence” seems to be about B not A, because there is no subsample_size in the plate code and they say it won’t matter for reparameterized variables, where we would not use “Rao-Blackwellization”.

So when would we want to drop plate and be “safe” with our assumption of dependence ?

If mini-batching, then need plate.

If reparameterized variables, then it won’t matter either way.

If non-reparameterizable variables (e.g. discrete variables), then we would want to use plate to reduce variance, as Martin says:

So I’m still unclear on what this section is recommending.

Thank you !

i would recommend always using plates, as it is more idiomatic. whether that information will be useful to pyro depends on various details including dependency structure, presence of discrete latent variables, etc. if you understand the details of elbo and elbo gradient estimation well-enough to know where you can get away without explicit plate denotation, then drop it.

Thanks again, Martin ! Coming back to what this section is trying to say…

“It is always safer to assume dependence” says:

in the first version Pyro must assume they are dependent (even though the normals are in fact conditionally independent).

Do you know what this means ? What is the model ?

You say you’d recommend always using plates, but this section says it is “safe” to not ?

Do you know when we would want to use this first version (no plates, with to_event()) ?

safe in the sense that it’ll be mathematically correct. it may still however perform poorly.

let’s say do an integral on [0, 1]^1000 with monte carlo. you’ll get an unbiased answer but probably gigantic variance. if you know a priori that your function is a product of the form f(x_1)f(x_2)… you can get a much lower variance estimate by using that information. but if you assumed that were the case and it were not you would get the wrong answer. same basic principle here.

In the first version:

x = pyro.sample("x", Normal(0, 1).expand([10]).to_event(1))

what is the model being fit ? This statement seems to contradict itself:

first version Pyro must assume they are dependent (even though the normals are in fact conditionally independent).

to_event says treat this as one large latent blob.

What is the distribution of that blob ? Are the components dependent or independent ?

the blob is assumed to be fully dependent

The blob is assumed to be dependent in the model (or guide) parts of the ELBO ? Where does it matter that it was written first as independent normals ? In what sense if at all is that the model ? I’m still unclear what model is being fit or what guide is being used.

sorry but i unfortunately haven’t time to answer all your questions. use plates everywhere and you’ll be fine. follow best practices by following the syntax demo’d in the extensive set of tutorials, e.g. this logistic regression tutorial.

if you want to understand the math in detail read all the relevant references, e.g.

[1] Stochastic Variational Inference, Matthew D. Hoffman, David M. Blei, Chong Wang, John Paisley
[2] Auto-Encoding Variational Bayes, Diederik P Kingma, Max Welling
[3] Automated Variational Inference in Probabilistic Programming, David Wingate, Theo Weber
[4] Black Box Variational Inference, Rajesh Ranganath, Sean Gerrish, David M. Blei
[5] Gradient Estimation Using Stochastic Computation Graphs, John Schulman, Nicolas Heess, Theophane Weber, Pieter Abbeel

Thank you for your help so far ! I am asking on behalf of a few folks who found the documentation confusing. My goal is to understand what it is trying to say and then create a PR to rewrite it in a way that is clear to more people and saves folks like you the time of having to clarify.

in all seriousness i’d ask chatgpt etc for help it knows quite a bit about pyro

See tables from ChatGPT below.

It looks like for x, to_event(1) hurts efficiency and doesn’t help with relaxing assumptions because the model is the same (independent).

But for z, to_event(1) makes the autoguide not mean-field, so we do indeed get more “safety”.

The documentation does not make this clear yet, might it be worth a rewrite ?