MultivariateNormal sampling with only some of the dimensions observed

8one6 · June 17, 2024, 9:10pm

Let’s say that I manage 6 bakeries. Each has its own menu (with no menu items in common across locations), and the number of items can vary from bakery to bakery. For example, bakery number 3 might offer 10 items, bakery number 4 might offer 15.

Every now and then, I ask a customer to rate a number of items that I offer them to try from a single location. Maybe one time a customer rates 3 of the 10 items at one location, and another time a customer rates 8 of the 11 items at another location. My customers are delightfully strange and give z-score ratings to each thing they try.

I believe that the right model here would be hierarchical with respect to the “true tastiness” of each recipe:

There should be some “global” tastiness for which I can specify a prior (single parameter)
There should be some “per-bakery tastiness bump” for which I can specify a prior (plate of parameters whose size is equal to the number of bakeries)
There should be some “per-item tastiness bump” for which I can specify a prior (plate of parameters whose size is equal to the total number of items)

With proper indexing, these three components of tastiness should be added together (deterministically) to get my estimate of the “all-in” tastiness of a given recipe.

If I wanted to treat every individual rating that a customer gave as independent, I think I know how to model this. In particular, say score is an array containing all the ratings I’ve gotten, item_id is an array of the same length that indicates what item was rated, and store_id is an array of the same length that indicates what store the item came from. In particular, assume item_id and store_id run from 0 up to the values they need to for the data. Then I think you’d write this as:

def bakeries_independent(score, item_id, store_id):
    tastiness_global = numpyro.sample('tastiness_global', dist.Normal(0, 1))
    scale_dispersion_item = numpyro.sample('scale_dispersion_item', dist.HalfNormal(0.5))
    scale_dispersion_store = numpyro.sample('scale_dispersion_store', dist.HalfNormal(0.5))
    scale_noise = numpyro.sample('scale_noise', dist.HalfNormal(1.0)

    with numpyro.plate('items', max(item_id) + 1):
        tastiness_item_bump = numpyro.sample('tastiness_item_bump', dist.Normal(0, scale_dispersion_item))

    with numpyro.plate('stores', max(store_id) + 1):
        tastiness_store_bump = numpyro.sample('tastiness_store_bump', dist.Normal(0, scale_dispersion_store))

    tastiness = numpyro.deterministic('tastiness', tastiness_global + tastiness_item_bump[item_id] + tastiness_store_bump[store_id])
   
    with numpyro.plate('scores', len(score)):
        numpyro.sample(dist.Normal(tastiness, scale_noise), obs=score)

Because I am treating every rating as independent of every other rating, I don’t need to pay attention to which ratings came from which customers. Lovely.

The problem is, I don’t think the ratings that a single customer provides are independent across the items they try. There are “hard graders” and “easy graders” (and this would lead to positive correlation among the ratings given by a single customer across all the items they try). And there are more- and less-similar items on the menu. If a customer loves our brownies-without-walnuts, they probably love our brownies-with-walnuts. But knowing their score on brownies-without-walnuts tells me less about their score on our croissants.

I’m happy to say that the “collection” of scores given by any given customer can be modelled as a dist.MutivariateNormal distribution. But I’m having trouble figuring out how to set that up appropriately in numpyro.

In particular, I’m struggling with questions of how to set up appropriate plates, given the varying sizes of various “ideas”:

In the stores plate, I believe I need to introduce a numpyro.sample statement that brings in a dist.LKJ distributed variable that represents the correlation matrix for each individual store. The dimension of this variable should match up with the number of distinct recipes offered by that store. However, I’m struggling because the number of recipes varies from store to store. Even if I had an argument in the model signature called distinct_recipes_per_store that contained that information, I’m not sure how to set up a plate so that I get, say, a 10-dimensional distribution for bakery number 3 but a 15-dimensional distribution for bakery number 4. I suppose I could set up the correlation matrix for each store to be as large as it needs to be for the store with the most distinct items, but that seems very inefficient.
I’m a bit at a loss for how to set up the numpyro.sample statement that will actually bring in the customer ratings as observations. Even if all my bakeries had the same number of recipes (and, again, they don’t), the number of items rated by each customer varies from one “rating collection” to the next. So at store 3 (which has 10 distinct items), one customer might score items numbers 1, 2 and 4, while another customer might score items 1, 2, 4, and 7, and another still might sore items 1, 5, 7, 8, and 9. The first customer’s ratings will be understood to come from a 3-dimensional distribution whose correlation matrix is a subset of the full 10-dimensional correlation matrix for that store; the second customer’s data will come from a 4-dimensional distribution whose correlation matrix is a subset of the full 10-dimensional correlation matrix from that store; and the third customer’s data will come from a 5-dimensional distribution whose correlation matrix is a subset of the full 10-dimensional correlation matrix from that store.

What’s the right way to handle this?