I’m working on inverse-reinforcement learning problem where the agent’s utility is a linear function of some hidden variable x \sim N(\mu, \sigma) and the utility function is U(a_t,x) = x - a_t. In addition the agent selects an action from the continuous interval [min,max] based on a SoftMax policy \pi = \frac{e^{U(x)}}{\int_{min}^{max}e^{U(y)}dy}.

I want to sample from x conditioned on the action a_{t-1} so my model is:

`x = numpyro.sample("utility", Normal(loc=0.0, scale=1.0)`

`x = x * s + m`

`actions = jnp.linespace(min, max)`

`utility = softmax(x - actions) / intergral`

`observation = numpyro.sample("counter_offer", dist.Categorical(probs=utility), obs=observation)`

I’m not sure that the model is right tough - that is, that it indeed conditioned on the observation since the last row is not well defined in the case of a single value (right?).

I’m looking for a solution to the problem of conditioning an action selection given a range of possible values. Thanks!

Hi @nitalon, I’m a bit confused by your notation. I would expect the normalizing constant of your Gibbs policy to be an integral over actions that depends on the value of `x`

, but your equation and your code snippets both seem to indicate that this is not the case.

Aside from that, in general the discretized approach you seem to be taking should be fine (i.e. consistent as the number of discrete values goes to infinity) as long as you also discretize the observed action (`observation`

). Alternatively, you could use `numpyro.factor`

directly with the continuous `observation`

value since `integral`

is analytically tractable:

```
numpyro.factor("observation", x - observation - np.log(integral))
```

1 Like

Thanks!