Hi @smp,
I have been using both Pyro and Numpyro for fitting computational models of behavior to experimental data. Note that you do not need to sample actions to estimate the parameters, as you are just using actions to estimate model likelihood. In your case both actions and outcomes of those action will be fixed by the behavior of participants, hence it is sufficient to have mapping from outcomes to action probabilities (logits).
For sampling actions from a model you can make separate Distribution (BehaviouralDistribution) and define the sampling process inside either sample method or another special method which would return both actions and outcomes.
I do not see an advantage to run environment inside the generative model your are using for parameter estimate. It will just slow down the inference.