Combining/suming hyperpriors in hierarchical model, issues?

Hello,

I am currently working on a model to estimate clicks based on a combination of ads and audiences. The dependent variable is the number of clicks per ad and audience, while the independent variable is the associated cost per ad and audience. Given the large number of ads and audiences, and limited data points, I am concerned about data sparsity and would like to pool the estimates to shrink them toward the mean. Specifically, I want to have one hyperprior for each audience and one for each ad. These hyperpriors should then be combined in some way to form the actual prior for the ad-audience pair (at least that is my idea).

In the context I am modeling, the coefficient b_{ad, audience} should be strictly positive, and the exponent sat_{ad, audience} should be bounded between 0 and 1, which leaves several choices. Currently, I multiply each hyperprior by 0.5 and sum them, though I could also multiply them, or potentially use a Dirichlet hyperprior to distribute across each of the individual priors. As for the likelihood function, I have not yet defined it. The output should be strictly positive and discrete unless I scale the output, so I am considering a Poisson likelihood function.

I am looking for feedback from more experienced Bayesian modelers about any potential issues with this approach. While I plan to check for R-hats, divergences, and energy plots, I’m curious whether there are any structural challenges with this setup that might present problems for an HMC sampler, or any inherent pathologies I might be overlooking. Any intuition on this would be greatly appreciated.

The model is as follows:

y_{ad, audience} = b_{ad, audience} \cdot \text{cost}^{sat_{ad, audience}}_{ad, audience}

Where:

  • y_{ad, audience} is the number of outcomes for a given ad and audience
  • \text{cost}_{ad, audience} is the associated cost for that ad and audience
  • b_{ad, audience} is a parameter for the relationship between the outcomes and cost
  • sat_{ad, audience} is the saturation parameter for each ad and audience

The priors I am using are:

\text{hyperprior}_{coef-ad}, \text{hyperprior}_{coef-audience} \sim \mathcal{N^+}(0, 1)
b_{ad, audience} \sim \mathcal{N^+}\left(\frac{\text{hyperprior}_{ad} + \text{hyperprior}_{audience}}{2}, \sigma \right)
\text{hyperprior}_{sat-ad}, \text{hyperprior}_{sat-audience} \sim \text{Gamma}(2, 1)
\text{hyperprior}_{sat-ad-beta}, \text{hyperprior}_{sat-audience-beta} \sim \text{Gamma}(2, 1)
sat_{ad, audience} \sim \text{Beta}\left( \frac{\text{hyperprior}_{sat-ad} + \text{hyperprior}_{sat-audience}}{2}, \\ \frac{\text{hyperprior}_{sat-ad-beta} + \text{hyperprior}_{sat-audience-beta}}{2} \right)

where \mathcal{N^+} represents a truncated normal.

i have no idea if your priors are good but one thing i’ll note that real world data are almost always over-dispersed compared to poisson so you likely want to use a negative binomial likelihood or similar

Have you seen anyone use this type of setup where different hyperpriors are combined by summing, using a Dirichlet distribution, or multiplying them in a multiple partial pooling context?

I have only found discussion threads where people use a single hyperprior. It feels like the case above would be a fairly standard use case, however.

It also feels like a case where the resulting posterior landscape might have some pathologies, especially in high dimensions.

i haven’t seen your particular choices before but certainly using a low-rank ansatz (in this case rank one) to parametrize matrices is a common modeling paradigm