How to avoid large number of unused random variables in hierarchical regression with 3 levels

I would like to do hierarchical regression with one more level than in this great Numpyro tutorial article: Bayesian Hierarchical Linear Regression — NumPyro documentation
[my only complaint about the example is that it uses very unergonomic Greek letters]

Basically, how would I extend the model if I wanted to add a hierarchical layer like Smoker/Non-Smoker/Ex-Smoker in between the entire group and the individual patient.

I know this is a contrived example, not something one would do in this particular case but I thought extending a canonical example would be more straightforward than coming up with my own.

My actual use case involves multi-level geographical hierarchies of the form: state > county > individual.

I’ve tried extending the model like this, but it’s very not pretty. Which is not surprising I guess since I introduce a thousand unconstrained variables since only a third of the [PatientID,Smoking] combinations exist in the data.

#%%
def model(PatientID, Weeks, Smoking, FVC_obs=None):
    mu_a_mu = numpyro.sample("mu_a_mu", dist.Normal(0.0, 100))
    mu_a_sigma = numpyro.sample("mu_a_sigma", dist.HalfNormal(100))
    sigma_a_sigma = numpyro.sample("sigma_a_sigma", dist.HalfNormal(100))
    mu_b_mu = numpyro.sample("mu_b_mu", dist.Normal(0.0, 100))
    mu_b_sigma = numpyro.sample("mu_b_sigma", dist.HalfNormal(100))
    sigma_b_sigma = numpyro.sample("sigma_b_sigma", dist.HalfNormal(100))

    unique_smoking_IDs = np.unique(Smoking)
    n_smoking_groups = len(unique_smoking_IDs)

    unique_patient_IDs = np.unique(PatientID)
    n_patients = len(unique_patient_IDs)

    # n_patients_per_group = pd.DataFrame({'Smoking':Smoking,'PatientID':PatientID}).apply(lambda x: x.PatientID.nunique()).max()

    with numpyro.plate("plate_i", n_smoking_groups):
        mu_a = numpyro.sample("mu_a", dist.Normal(mu_a_mu, mu_a_sigma))
        sigma_a = numpyro.sample("sigma_a", dist.HalfNormal(sigma_a_sigma))
        mu_b = numpyro.sample("mu_b", dist.Normal(mu_b_mu, mu_b_sigma))
        sigma_b = numpyro.sample("sigma_b", dist.HalfNormal(sigma_b_sigma))
        with numpyro.plate("plate_j", n_patients):
            a = numpyro.sample("a", dist.Normal(mu_a, sigma_a))
            b = numpyro.sample("b", dist.Normal(mu_b, sigma_b))

    sigma = numpyro.sample("sigma", dist.HalfNormal(100.0))
    FVC_est = a[PatientID,Smoking] + b[PatientID,Smoking] * Weeks

    with numpyro.plate("data", len(PatientID)):
        numpyro.sample("obs", dist.Normal(FVC_est, sigma), obs=FVC_obs)

#%% 
le = LabelEncoder()
train["PatientID"] = le.fit_transform(train["Patient"].values)
train["SmokingID"] = le.fit_transform(train["SmokingStatus"].values)

FVC_obs = train["FVC"].values
Weeks = train["Weeks"].values
PatientID = train["PatientID"].values
SmokingID = train["SmokingID"].values

Is there a better way to do this? I could make the second plate only as narrow as the largest number of unique patients for all smoking groups.

Even better would be to make the inner plate size only as large as it need be, the number of unique patients within the particular group. Is this possible?

I managed to get it to work. Originally, it failed to converge because I had ordered the indices wrong here:
FVC_est = a[PatientID,Smoking] + b[PatientID,Smoking] * Weeks, the first plate is always on the right and then one proceeds leftwards.

I reduced the number of superfluous undetermined random variables somewhat (but not by much) using this customized label encoding for PatientID (now the PatientID is only unique in combination with SmokingID):

#%%
train["SmokingID"] = le.fit_transform(train["SmokingStatus"].values)
#%%
n_smoking_ids = train.SmokingID.max()+1
encoders = [ LabelEncoder() for i in range(n_smoking_ids)]
# %%
for i in range(n_smoking_ids):
    encoders[i] = encoders[i].fit(train[train.SmokingID == i].Patient)

PatientID = []
for i in train.itertuples():
    i.Index
    encoder = encoders[i.SmokingID]
    result = encoder.transform([i.Patient])
    PatientID.append(result[0])
train['PatientID'] = PatientID

This is the model, I reduced the number of variance hyperparameters since two levels of variance hyperparameters seemed overkill.

This now converges and iterates at roughly 50it/s on my 2018 MacbookPro, not too bad.

def model(PatientID, Weeks, Smoking, FVC_obs=None):
    mu_a_mu = numpyro.sample("mu_a_mu", dist.Normal(2000, 1000))
    mu_b_mu = numpyro.sample("mu_b_mu", dist.Normal(0.0, 100))
    sigma_a = numpyro.sample("sigma_a", dist.HalfNormal(1000))
    sigma_b = numpyro.sample("sigma_b", dist.HalfNormal(10))

    unique_smoking_IDs = np.unique(Smoking)
    n_smoking_groups = len(unique_smoking_IDs)

    unique_patient_IDs = np.unique(PatientID)
    n_patients = len(unique_patient_IDs)

    with numpyro.plate("plate_i", n_smoking_groups):
        mu_a = numpyro.sample("mu_a", dist.Normal(mu_a_mu, 1000))
        mu_b = numpyro.sample("mu_b", dist.Normal(mu_b_mu, 100))
        with numpyro.plate("plate_j", n_patients):
            a = numpyro.sample("a", dist.Normal(mu_a, sigma_a))
            b = numpyro.sample("b", dist.Normal(mu_b, sigma_b))

    sigma = numpyro.sample("sigma", dist.HalfNormal(100.0))
    FVC_est = a[PatientID,Smoking] + b[PatientID,Smoking] * Weeks

    with numpyro.plate("data", len(PatientID)):
        numpyro.sample("obs", dist.Normal(FVC_est, sigma), obs=FVC_obs)

If anyone can think of a more elegant way to plate, let me know. I’m not very happy with the fact that I have 600 random variables flying around that serve absolutely no purpose at all.

600 random variables flying around that serve absolutely no purpose at all

Could you elaborate on why this happens in your model?

For modeling, I would recommend reparameterizing your model (like this one).

600 random variables flying around that serve absolutely no purpose at all

Could you elaborate on why this happens in your model?

Thanks, yes of course!

The hierarchy is Smoker > PatientID.

There are 3 Smoker types but each of the Smoker types has unequal numbers of Patients. There are ca. 100 Non-Smokers, 50 Ex-Smokers and 10 current smokers.

In the current implementation, the width of the plate needs to be the number of patients in the largest smoker type. And as a result, there are latent variables for 50 Ex-Smoker and 90 current smoker patients sampled that don’t exist at all.

Does that explanation make sense?

In the first post, the inner plate was 170 wide, which was even more wasteful than in the second post where the inner plate is now 100 wide, the width of the largest group.

For modeling, I would recommend reparameterizing your model (like this one).

Do you mean using non-centred parametrization? I don’t yet know how to do this, will try to figure out, but it’s not the core issue here in this thread, unless I misunderstood you.

I see. I guess you can do

with numpyro.plate("plate_i", n_smoking_groups):
    mu_a = ...
    mu_b = ...
with numpyro.plate("data", len(PatientID)):
    a = numpyro.sample("a", dist.Normal(mu_a[Smoking], ...)
    b = ...
    FVC_est = ...
    numpyro.sample("obs", dist.Normal(FVC_est, sigma), obs=FVC_obs)
# here patent_smoking_group is the smoking group of each patient
1 Like

Thanks! Your suggestion helped me figure it out. It’s a bit trickier than that since what you proposed would not use the entire hierarchy as desired - the second plate should only go over the number of unique PatientIDs and hence not contain the observation samples.

The result looks something like this, I need to create that mapping from PatientID to SmokingGroup first:

SmokingGroup_per_PatientID = pd.DataFrame({"Smoking": Smoking, "PatientID": PatientID}).groupby(['PatientID']).Smoking.first().values

with numpyro.plate("plate_i", n_smoking_groups):
    mu_a = ...
    mu_b = ...
with numpyro.plate("data", n_unique_patientIDs):
    a = numpyro.sample("a", dist.Normal(mu_a[SmokingGroup_per_PatientID], ...)
    b = ...

FVC_est = a[PatientID] + b[PatientID] * Weeks
1 Like

Thanks for your help @fehiepsi, I’ve managed to get it to work and this is the result in case you’re curious. It’s a hierarchical model of Omicron share in Germany.

Could this be interesting as an example of more than 2-level regression?

Yeah, I think so. As above, I think it is better to reparameterize the model, even it already has good results. You can simply add 3 lines to your notebook

from numpyro.infer.reparam import LocScaleReparam

reparam_config = {k: LocScaleReparam(0) for k in ["mu_a", "a"]}

@numpyro.handlers.reparam(config=reparam_config)
def model(...):
1 Like

Thanks for the tip!

The thing that stops me from reparametrizing is that I don’t know how to invert the repararametrization. I’m interested in the values for the unreparametrized variables mu_a and a. How is this possible? I couldn’t find it anywhere (yet). Would be great to add this.

I think they will appear in the trace by default. Do you see them when print summary or call get samples method?

Ah genius. They don’t show in mcmc.print_summary() by default (is that a bug?) but I can get them to show as follows:

numpyro.diagnostics.summary(mcmc.get_samples(),group_by_chain=False)

Ah, sorry, it is not default. You will need to set exclude deterministic to False Markov Chain Monte Carlo (MCMC) — NumPyro documentation . :slight_smile:

1 Like

Does automatic reparametrization not work with discrete latent variables?

I get a ValueError, incompatible shapes for broadcasting:

reparam_config = {"a": LocScaleReparam(0) for k in ["a"]}

# Reparametrization doesn't seem to work with discrete latent variable
@numpyro.handlers.reparam(config=reparam_config)
def model(sending_pc, sample_day, total_counts,omi_counts=None):
    mu_a = numpyro.sample("mu_a", dist.Normal(0, 5.0))
    sigma_a = numpyro.sample("sigma_a", dist.HalfNormal(3.0))
    b = numpyro.sample("b", dist.Normal(0.2, 0.2))
    outlier_p = numpyro.sample("outlier_p", dist.Beta(1, 20))
    outlier_sigma = numpyro.sample("outlier_sigma", dist.TruncatedNormal(0.1,1,1))

    n_labs = len(np.unique(sending_pc))

    with numpyro.plate("plate_j", n_labs):
        outlier = numpyro.sample("outlier", dist.Bernoulli(outlier_p))

        a = numpyro.sample("a", dist.Normal(mu_a, sigma_a * (1+outlier_sigma * outlier)))

    logit_est = a[sending_pc] + b * sample_day

    with numpyro.plate("data", len(sending_pc)):
        numpyro.sample("obs", dist.BinomialLogits(logits=logit_est, total_count=total_counts), obs=omi_counts)

nuts_kernel = NUTS(model)

mcmc = MCMC(nuts_kernel, num_samples=5000, num_warmup=2000)
rng_key = random.PRNGKey(0)
mcmc.run(rng_key, sending_pc_ID, sample_day, total_counts, omi_counts=omi_counts)

#ValueError: Incompatible shapes for broadcasting: ((4770, 90), (1, 4770))

I guess it is a bug. Could you help me create a github issue for this? Thanks!

1 Like

@fehiepsi I opened an issue with fully reproducible code in a Gist

1 Like

Thanks, @corneliusroemer!