Questions about Funsors paper

I am reading the funsors paper and got stuck in a couple of questions.

In Figure 1, particularly the Gaussian defining p_x|y,z. I don’t see how the value of z (replaced by bold z[c] in the main marginalized expression) has an influence on defining that distribution, since the parameters of the Gaussian are constants even though z is supposed to represent the mean for the Gaussian.

It is also unclear to me why we need c to be in the shape of that funsor since z is already in the R^3 shape. It would make more sense to me to receive c if the funsor were also indexed by bold z of shape R^{2 x 3} since that would require knowing which component we are talking about.

Still about that Gaussian, how does it “know” x is the dimension for the Gaussian-distributed variable, and not c or z?

Another thing that puzzled me was that a normalized Gaussian is a product of a tensor and a Gaussian funsor. How is the Gaussian funsor not normalized by its own definition?

Hi @rodrigobraz,

Let’s start with Figure 1.

Here p_z is a prior distribution over the means of the mixture components; it is shared among both components. When we want to include that prior in our model we could write in unvectorized form

p_z[z\mapsto {\bf z}[0]] \times p_z[z\mapsto {\bf z}[1]]

That’s just a sum of two function evaluations, one evaluating the density p_z at the first cluster mean {\bf z}[0], and one evaluating p_z a the second cluster mean {\bf z}[1]. Now to vectorize that expression we can instead write

\prod_c p_z[z\mapsto {\bf z}[c]]

where the expression p_z[z\mapsto {\bf z}[c]] has two free variables: \bf z:\mathbb R^{2\times 3} and c:\mathbb Z_2.

how does p_{x|c,z} “know” x is the dimension for the Gaussian-distributed variable, and not c or z?

The “dimensions” of p_{x|c,z} are named, not positional. This is similar to how variables in Python expressions are named. When you construct a funsor p_{x|c,z} you can examine its free variables, e.g.

info_vec = torch.zeros(2, 6)
precision = torch.zeros(2, 6, 6) + torch.eye(6)
inputs = OrderedDict([
    ("c": Bint[2]),
    ("z": Reals[3]),
    ("x": Reals[3]),
])
p_x_given_c_z = Gaussian(info_vec, precision, inputs)
assert set(p_x_given_c_z) == {"c", "z", "x") 

How is the Gaussian funsor not normalized by its own definition?

Explanation 1. In funsor, Gaussians can be rank deficient, allowing us to treat conditional and joint distributions uniformly. Because funsor’s Gaussians can be rank deficient, they are not necessarily normalizable, i.e. their integral may not be finite. But all we want is a canonical form, and since integrating a Gaussian won’t work to canonicalize, we instead chose to canonicalize by ensuring that each Gaussian evaluates to zero at its mode. When we compile a Normal distribution or a Multivariate distribution to a funsor Gaussian, we split that into a canonical Gaussian part that evaluates to zero at its mode, plus a Tensor for the log normalizer. (note funsor code is all in log space, but the paper is in linear space for clarity).

Explanation 2. Non-normalized distributions are more natural objects than normalized distributions. What are conditioned models? Non-normalized probability distributions. That is, the class of non-normalized distributions is closed under conditioning. Are posterior distributions normalized? No, of course not, posteriors should be scaled by the marginal likelihood of data. What happens if you drop that log normalizer, the partition function? Then score function gradient estimators would be wrong, or would need to be accounted out of channel. In the funsor library we have diverged from the popular practice of normalizing all distributions, and in return we get infinite differentiability of all distributions: reparametrized continuous distributions are differentiable on the x axis, and discrete distributions are differentiable in the log-density axis.

Thank you for your reply, @fritzo.

Here p_z (…)

Sure, the use of p_z had been very clear to me. When I mentioned the replacement of z by z[c] I was referring to its use in p_{x|c,z}.

I am not seeing how named parameters help defining that z is the mean of the Gaussian and x is the Gaussian-distributed variable in that Gaussian.

When you say that parameters are named rather than position, I get the impression you are saying that the specific name x matters and that using, say, y, would not work. However I realize that explanation is unlikely; looking at the code and documentation, there is no mention of that, and the code does not make any assumptions about the names.

So that is my main stumbling block at this point.

Another particularly puzzling point to me is the role of c in p_{x|c,z} because x is independent of c given z. So it doesn’t seem to me it needs to be a parent of x, and c is not substituted when we use p_{x|c,z} in the main expression.

I don’t quite follow Explanation 1 although that is just because I am not familiar enough with the canonical representation of Gaussians.

Explanation 2’s point about normalization makes sense to me, but I don’t get some of the steps in the later parts. You say “what happens if you drop the log normalized? Then score function gradient estimators would be wrong, or would need to be accounted out of channel”. So, did you drop the log normalized and found a way of working with the gradients anyway, or did you do something else altogether?

Thanks.

Hi @rodrigobraz,

Let’s focus on p_{x|c,z}. This funsor is a batched Gaussian over two variables, x and z. It is a Gaussian distribution in the sense of functional analysis in that the log density at any (x,z) pair is given by a quadratic function of (x,z) . In funsor we identify “Gaussian” with “positive semidefinite quadratic log densities” (or batches thereof). The funsor p_{x|c,z} is not a normalized Gaussian distribution over (x,z) in the sense of statistics: it is normalized over x for any given z value, but it is neither normalized over the pair (x,z) nor normalized over z for any given x value. But it is still a Gaussian funsor over the pair (x,z) and indeed the funsor library does not distinguish between normalized Gaussian distributions (in the sense of statistics) and conditional Gaussian distributions: they are both viewed simply as log density functions that happen to be positive semidefinite quadratic in all their continuous free variables.

Re: naming, it is important to note that the funsor p_{x|z,c} does not treat x differently from z: they are both simply continuous-valued inputs to a lazy expression. When we write p_{x|c,z} in funsor, the funsor library does not know which is the normalized variable and which is conditioned. Here’s how we spell this in funsor:

p_x_given_c_z = Gaussian(
    torch.tensor(...),  # data for the information vector
    torch.tensor(...),  # data for the precision matrix
    OrderedDict([
        ("c", Bint[2]),
        ("x", Reals[3]),
        ("z", Reals[3]),
    ]),
)

We could move elements around in info_vec and precision and equivalently create a Gaussian with the order of x, z swapped:

p_x_given_c_z = Gaussian(
    torch.tensor(...),  # data for the information vector
    torch.tensor(...),  # data for the precision matrix
    OrderedDict([
        ("c", Bint[2]),
        ("z", Reals[3]),
        ("x", Reals[3]),
    ]),
)

Note that the semantics of conditioning in our expression x|c,z is converted to a variable name _x_given_c_z and not passed to the funsor library. We know that the Gaussian funsor is the conditional log density of x given z, but the funsor library doesn’t know that, it only knows that we’ve given it a non-normalized quadratic log density and have asked funsor to perform computations with this non-normalized quadratic log density over two continuous variables.

Does that help?

Thanks, that helps. The stumbling block was expecting some kind of differentiation between z and x.

Now I still have remaining questions about c in that funsor. How does the system determine that the Gaussian is over z,x but not c? Does it do it based on the fact that c is integer-typed? Could c have been real-typed but not be considered one of the Gaussian-distributed variables?

Also like I mentioned above, I am unclear about the need to have c in p_{x|c,z}. Here is how I read the example:

\sum_c p_c \times p_{x|c,z}[z \mapsto {\bf z}[c], x \mapsto \widehat{\bf x}[j]] is equivalent to

p_{c} [c \mapsto 0] \times p_{x|c,z}[z \mapsto {\bf z}[0], x \mapsto \widehat{\bf x}[j]] + p_{c} [c \mapsto 1] \times p_{x|c,z}[z \mapsto {\bf z}[1], x \mapsto \widehat{\bf x}[j]]

From what I understand, p_{x|c,z}[z \mapsto {\bf z}[0], x \mapsto \widehat{\bf x}[j]] evaluates to a funsor with a single {\mathbb Z}^2 dimension c containing two Gaussians, both of which defined over {\bf z}[0]. This confuses me because I would expect either to have a single Gaussian, or to have two Gaussians, each of which corresponding to one of the Gaussian components, not to have both defined on {\bf z}[0]. We also have the analogous situation in the second term for {\bf z}[1].

If I were to write the model myself with my current understanding, I would have written something like:

\sum_c p_c[c \mapsto c] \times p_{x|z}[z \mapsto {\bf z}[c], x \mapsto \widehat{\bf x}[j]],

with a previously defined

p_{x|z} \leftarrow \operatorname{Gaussian}((z:{\mathbb R}^3, x: {\mathbb R}^3), i_x, \Lambda_x).

(The p_c[c \mapsto c] is my attempt of reducing {\mathbb Z}^2-shaped p_c to a single probability, but I am not sure that works, or if p_c[c] would make more sense instead, or neither).

Alternatively, I would try writing a vectorized form as in:

p_c \times p_{x|z}[z \mapsto {\bf z}, x \mapsto \widehat{\bf x}[j]].

Can you identify what I am missing?

Also, another question:

So, I had understood that the funsor definition didn’t give special status to either x or z, and therefore did not define a direction between them, so I don’t understand the asymmetry in that it is normalized over x for any value of z but not the other way around. Can you shed some light on that?

Quickly addressing your last question first: the funsor p_{x|c,z} happens to be normalized over x because of the values of the tensors we pass into the Gaussian. The funsor code does not leverage this purely numerical fact.

Thanks for your patience :slightly_smiling_face:

How does the system determine that the Gaussian is over z,x but not c? Does it do it based on the fact that c is integer-typed?

Correct, Funsor’s Gaussian is a quadratic form jointly in all its real inputs and is batched jointly over [the cartesian product] of its bounded integer inputs.

Could c have been real-typed but not be considered one of the Gaussian-distributed variables?

Nope, all of Funsor’s Gaussian inputs are either Bint batch variables or Real/Reals[...] Gaussian variables.

From what I understand, p_{x|c,z}[z \mapsto {\bf z}[0], x \mapsto \widehat{\bf x}[j]] evaluates to a funsor with a single {\mathbb Z}^2 dimension c containing two Gaussians, both of which defined over {\bf z}[0].

It’s a little more precise to say that it evaluates to a batched Gaussian with batch index c:Bint[2] and continuous variable {\bf z}[0]. You can think of this as part of a Gaussian mixture model over {\bf z}[0], but where the mixture weights are stored elsewhere (in another Tensor with inputs c).

I would have written something like \sum_c p_c[c \mapsto c] \times p_{x|z}[z \mapsto {\bf z}[c], x \mapsto \widehat{\bf x}[j]]

well that makes it look like p_{x|z} doesn’t depend on c. That looks like a mixture model whose two mixtures are identical.

It might help to think about the underlying numerical arrays. If you want to represent a Gaussian you’ll need something like a mean vector of total length sum(x.numel() for x in vectors) and something like a covariance matrix of the square of that length. (Note the funsor paper uses information vector and precision matrix, and the latest funsor uses a square root parameterization, but in all cases you need a vector and a matrix). Now to represent a mixture model, you’ll need to batch that vector and matrix over the number of mixture components. Now when I write p_{x|c,z} I’m thinking I’ll need a batched matrix vector pair, where the vector has length x.numel() + z.numel() = 3 + 3 = 6 and batch size 2, so the vector has shape (2, 6) and the matrix has shape (2, 6, 6). If you drop that c index then the underlying data wouldn’t be batched.

Sure, thank you for your reply. :slightly_smiling_face:

Should I have gotten those facts about Gaussians from the paper? It doesn’t look like those things were stated.

As far as I can tell your description is equivalent to mine: essentially, we are representing two Gaussians, both of which have mean {\bf z}[0].

And this is not what we want. Recall that p_{x|c,z}[z \mapsto {\bf z}[0], x \mapsto \widehat{\bf x}[j]] is just one of the terms of the expansion of \sum_c in the last line of Figure 1. It corresponds to the single component with mean {\bf z}[0]. So it should be about a single Gaussian, not two.

Yes, like you say, it is a part of the Gaussian mixture model, namely the part regarding the first component. So at this point the weight of the components no longer need to be considered.

Yes, that is the point I have been making: in this formulation, p_{x|z} is not the mixture model, it is the dependence of x on the chosen component (which has been provided as {\bf z}[c \mapsto c]). Note we are writing z, which is the mean of one component, not {\bf z}.

Figure 1, unlike my formulation immediately above, displays p_{x|z,c}, not p_{x|z}, but regardless of that it is using z, not {\bf z}, so it cannot be the entire mixture model. The mixture model should include {\bf z}.

Hopefully I’ve clarified that it is not the mixture model, but just the dependence of x on a single component. And this is true not only in my own alternative encoding, but also in Figure 1 because it uses z instead of {\bf z}.

True, if that line were representing the full mixture model, I would agree. However, that does not seem to be the case because of the use of z like I mentioned above.

Even if we assume this as a typo (which I don’t think it is) and say that it should have been {\bf z} instead of z, there is still a problem because then the parameter {\bf z} would have shape (3,3) but in the last line we are passing a 3-vector {\bf z[c]} to it.

Maybe it would help to clarify: this reading is incorrect in the second expression, because p_{x|c,z} does depend explicitly and directly on c (as the argument c:\mathbb{Z}_2 to Gaussian at creation of p_{x|c,z} indicates and as the caption of the figure says, the two parameter arrays are batched over c).

The correct, verbose version of one summand in your first expression would be

p_{c} [c \mapsto 0] \times p_{x|c,z}[z \mapsto ({\bf z}[c][c \mapsto 0]), x \mapsto \widehat{\bf x}[j]][c \mapsto 0]

where there is now an explicit substitution of a single value of c into each subterm depending on c, including directly into p_{x|c,z} which selects the correct subset of parameters i_x, \Lambda_x for the first mixture component. Note that this is syntactically equivalent to the more compact expression

( p_c \times p_{x|c,z}[z \mapsto {\bf z}[c], x \mapsto \widehat{{\bf x}}[j]])[c \mapsto 0]

which should make it clear that the result cannot still contain the free variable c and hence must be a single (weighted) Gaussian, replicated for each datapoint j.

Finally, note that this explanation does not contradict your correct understanding that

Rather, you just missed an extra substitution of c \mapsto 0 that would make this the correct expression for the summand.

I feel like this latex notation isn’t getting us closer to a shared mental model, and indeed one of the paper reviewers suggested we should use the library’s real syntax instead. Would it help to switch to actual Funsor syntax? Here’s a colab notebook with Figure 1. @rodrigobraz and @eb8680_2 you should both be able to edit.

Thanks, @fritzo. Just wanted to let you know that I read your reply and after thinking for a long time I think I finally got it. I have some thoughts to write about, which I will do probably only next week. I haven’t had time to look into the notebook in detail yet but will do so. Thanks!

2 Likes

Hello again. My apologizes for taking so long to react. I did take a good look at the colab notebook and I believe I finally got the meaning behind the encoding. My current view is that it makes sense and is useful, but somehow the notation makes is very hard to understand. I have not thought long enough about it to put my finger on why that is, or if I have any better proposal. I would have expected you to do the same things with a more straightforward, standard and therefore more intuitive mathematical notation, but I am not quite sure what would need to be done about it. In my own work I have I believe managed to make it simpler, but my framework does not currently seem to do as much as funsors, so it may be that I simply did not get to the hard parts yet.