Conditioning on available samples of minibatch

Hi everyone,

I would like to develop a time series model where the output at each timesteps output_t depends on some input_t (input at that same timestep) and the output of previous timestep, output_{t-1}.

Both in training and in inference I will have some observations of output at different timesteps.

Let’s imagine a dataset of N timeseries of lenght T_max = 6 where I have observed the :

1_output = [None,None,obs_3,None,None,obs_6]
2_output = [obs_1,obs_2,None,None,None,obs_6]
N_output = [None,obs_2,obs_3,None,None,None]

So for the first timeseries I have observed the output at time 3 and 6.
For the second timeserie I have observed the output at time 1,2 and 6.

And aggregating all the output tensors in a batch i.e batch.shape = [N,T_max]

My question is, how do I condition the sampling to only the observations I have?
Looking at the deep markov model tutorial I thought something like:

for t in range(batch.size(1)):
     [some code]
     out_t = pyro.sample("obs_x_%d" % t, dist.Normal(loc_out, scale_out), obs=batch[:, t ])

But how can I specify to condition only on the observation I have?
I mean can I somehow mask the observations?
I guess a possible solution might be:

for t in range(batch.size(1)):
    for sample in range(batch.size(0)):
       out_t[sample] = pyro.sample("obs_x_%d" % t, dist.Normal(loc_out[sample],scale_out[sample]),obs=batch[sample, t ])

But it looks odd, and I am wasting vectorization.
I tried to make it clear but I understand it might be kinda confusing, I am more than available to elaborate more. Please Help =D

You can use pyro.poutine.mask:

mask = torch.tensor([
    [x is not None for x in 1_output],

assert mask.shape == (N, T_max)
assert batch.shape == (N, T_max)
with pyro.plate("batch", N):
    for t in range(T_max):
        with mask(mask=masks[..., t]):
            out_t = pyro.sample(f"obs_x_{t}", Normal(loc_out, scale_out), obs=batch[:, t])

See the hidden Markov model examples for other examples of mask usage.

Thanks for your kind answer.
Sadly I am having troubles with poutine mask even with very simple examples.

> observations = torch.rand(10)
> mask_ones = torch.ones(10,dtype = torch.bool)
> mask_zeros = torch.zeros(10,dtype = torch.bool)
> with poutine.mask(mask = mask_ones):
>         out = pyro.sample("test",
>         dist.Normal(loc,scale),obs = obs)
>         print(out)
> with poutine.mask(mask = mask_zeros):
>         out = pyro.sample("test", dist.Normal(loc,scale),obs = obs)
>         print(out)

I would expect to see a difference in the two printed outputs, since one should be conditioned on the observation and one shouldn’t but I obtain the same results for both (out is equal to observations).
What am I missing?

mask has an effect on the log_prob that corresponds to the sample statement not on the random values themselves.

I don’t understand, could you elaborate further please? Which probability am I affecting by masking some observations?

Btw what I want to achieve is to develop a time series model which employs input_t and out_{t-1} to predict out_t . Out_t can either be observed (if the observation is available):
out_t = pyro.sample("obs_x_%d" % t, dist.Normal(loc_out, scale_out), obs=batch[:, t ])

or not:

out_t = pyro.sample("obs_x_%d" % t, dist.Normal(loc_out, scale_out))

This is because at each timestep i assign
loc_out = out_{t-1} + f(input_t)

where f is an arbitrary function.
Does the approach make sense?

the basic idea of mask in pyro boils down to the fact the vectorized operations are much faster than for loops. imagine i have a 10 observations only 5 of which are actually observed. then i could write something like

pyro.sample("obs0", ..., obs=data[0])
pyro.sample("obs2", ..., obs=data[2])
pyro.sample("obs3", ..., obs=data[3])
pyro.sample("obs7", ..., obs=data[7])
pyro.sample("obs8", ..., obs=data[8])

this would be annoying and slow. instead we can write something like

with pyro.plate("data", 10):
        pyro.sample("obs", ..., obs=data)

under the hood this will still compute 10 log probabilities for the 10 observations (to keep the computation vectorized and fast) but 5 of them will be ignored by whatever inference algorithm is used

1 Like

Sorry for the delay in the answer, it has been a rough week.
Thanks a lot, it’s way more clear now. I still have a question though.

Imagine I have to develop a model which predicts the total sales performed by a shop based on some daily input.
So calling s_0,s_1,…,s_t the sales performed in day 0, day 1, … day_t I have to predict:

out_0 = s_0
out_1 = s_0 + s_1

out_t = s_0 + s_1 + …y… + s_t

Sadly I don’t have all the observations (so as stated in a previous question for a time series I might have:

obs = [out_0,None,out_2,None,…,out_t]).

Now in my model I will probably state something like:

daily_sales = f(input) #where f is a learnable function 

loc_out_t = out_{t-1} + daily_sales 
scale_out_t = g(scale_out_{t-1},input) # where g is a learnable function


with pyro.plate("data", N_batch):
                  out_t = pyro.sample("obs_x_%d" % t,
                              dist.Normal(loc_out_t, scale_out_t),

So I do have a couple of questions:

  1. Is the approach correct?
  2. If I have not observed out_{t-1}, loc_out_t will still use the value provided by obs in t-1 as the value of out_{t-1} despite the fact that is masked (because as u explained me before it’s masked for the computation of the log probability).

So I will have for example

out_1 = None


loc_out_2 = None + daily_sales.

What I think it should use instead in case of missing observation would be:

out_{t-1} = pyro.sample("obs_x_%d" % t, dist.Normal(loc_out_{t-1}, scale_out_{t-1}))

If you can help me figure out this would be incredbly useful, thanks a lot for your effort!

Hi @Mowgli, if I am reading your post correctly, future observations will depend on missing values at previous timesteps, so you will need to impute missing values rather than simply hiding them with poutine.mask.

One way to do this is to use the obs_mask keyword argument to pyro.sample, which takes a mask like the ones you have been passing to poutine.mask and uses it to create two sample sites, one observed and one latent, and merges them into a single array using the mask.

Your snippet above would be rewritten as something like

with pyro.plate("data", N_batch):
    out_t = pyro.sample("obs_x_%d" % t,
                        dist.Normal(loc_out_t, scale_out_t),
                        obs=obs[:,t], obs_mask=mask)
1 Like

Super! Thanks a lot! Just tested it on some fake data and looks like it does exactly what I need!
I will surely need some more help but I least I got a brick!
Thanks again, I am really lost alone out here =D

1 Like

Hi @eb8680_2, sorry for bringing up this again.
I tried developing a model where, during train phase, I pass a vector of observations and a obs_mask but I got dissapointing results.
What I noticed is that when I include the argument obs_mask for some reason the training becomes way more difficult even if the obs_mask is just full of True values.

    with pyro.plate("data",x.shape[0]):
        obs = pyro.sample("obs",dist.Normal(mean_output,sigma_output).to_event(1),obs = y)


    with pyro.plate("data",x.shape[0]):
        obs = pyro.sample("obs",dist.Normal(mean_output,sigma_output).to_event(1),obs = y,obs_mask=obs_mask)  

slows down training and gives worse results even with

obs_mask = torch.bernoulli(torch.ones(N_batch)) > 0 # bool tensor full of “True” values

Furthermore it looks like the loss is computed in a different way, I sometimes get strange loss values (loss oscillating between -5e21 to -5 e24 while before I had loss decreasing from ~+200 to -1300) and I have to pick lower learning rates to avoid divergence.

Any clue on why this happens?
Here’s my code, in case it might be useful to understand the problem

class BayesianRegression(PyroModule):
    def __init__(self,in_features,out_features,latent_features,hidden_features = 4):
        super(BayesianRegression, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.latent_features = latent_features
        self.hiddden_features = hidden_features

        self.linear_input_hidden = nn.Linear(in_features,hidden_features)
        self.linear_hidden_latent = nn.Linear(hidden_features,latent_features)

        self.linear_latent_hidden = nn.Linear(latent_features,hidden_features)
        self.linear_hidden_hidden2 = nn.Linear(hidden_features,hidden_features*2)
        self.linear_hidden2_hidden3 = nn.Linear(hidden_features*2,hidden_features)
        self.linear_hidden_output = nn.Linear(hidden_features,out_features)

        self.linear_input_sigma = nn.Linear(latent_features,out_features)

        self.relu = nn.ReLU()
        self.softplus = nn.Softplus()

    def forward(self,x,y = None,latent_obs = None , obs_mask = None):

        sigma_latent = pyro.sample("sigma_latent",dist.Uniform(0.,1.))
        mean_latent_ = self.relu(self.linear_input_hidden(x))
        mean_latent = self.linear_hidden_latent(mean_latent_)

        with pyro.plate("latent_data",x.shape[0]):
            latent = pyro.sample("latent_obs",dist.Normal(mean_latent,sigma_latent).to_event(1),obs = latent_obs)

        sigma_output = self.softplus(self.linear_input_sigma(latent))

        mean_output_ = self.relu(self.linear_latent_hidden(latent))
        mean_output_ = self.relu(self.linear_hidden_hidden2(mean_output_))
        mean_output_ = self.relu(self.linear_hidden2_hidden3(mean_output_))

        mean_output = self.linear_hidden_output(mean_output_)

        with pyro.plate("data",x.shape[0]):
            obs = pyro.sample("obs",dist.Normal(mean_output,sigma_output).to_event(1),obs = y) #,obs_mask=obs_mask)

Basically I have a NN which from input computes some latent variables (self.linear_input_hidden, self.linear_hidden_latent).
The latent variables are also observed ( latent = pyro.sample(…,obs = latent_obs).
The latent pass through another two NNs, one for output mean (self.linear_latent_hidden … self.linear_hidden_output) and one for output sigma (self.linear_input_sigma).

presumably your obs_mask is the wrong shape. what shape are mean_output, x, y, etc?

Thanks for answering again!

y.shape = [N_batch,output_dim] = [100,2]
x.shape = [N_batch,input_dim] = [100,4]
latent.shape = [N_batch,latent_dim] = [100,2]

mean_output.shape = [N_batch,output_dim] = [100,2]
sigma_output.shape = [N_batch,output_dim] = [100,2]

and finally:

obs_mask.shape = [N_batch] = 100

I tried to make obs_mask.shape = [N_batch,output_dim] but I got an error so I went back to obs_mask.shape = [N_batch].

Any clue?

Bonus question:
I have another question, I am a little confused on how pyro treats the output of the neural net (in my case mean_output and sigma_output).
What I understood is that during optimization the torch weights are optimized to obtain the posterior probability of mean_output and sigma_output. The posterior is computed by pyro from the prior (which in this case is the output of the NN ) and from the observations. Did I get it right?
If so I am passing a “random prior” since I have left the initialization of the weights of the neural net to pytorch.
At the beginning the weights of the nn are not initialized so they give to mean_output and sigma_output a value which is kinda random (depending on the random initialization of the weights by pytorch). Does those random values act as a prior probability? If so how do I ensure that these priors are plausible?

i’m sorry i don’t understand what you’re trying to do. can you explain? maybe start with something simpler before adding a bunch of neural networks.

I want to train a model which, from some input variables, computes the output variables exactly as the ruggedness-log gdp introduction tutorial.
So to compare to the ruggedness tutorial my input would be the ruggedness and my output the log gdp of the countries. The difference is I don’t want to assume a linear relationship but I rather learn the function in between (that’s why I’m using the NN).
In the tutorial:

mean = a + b_a * is_cont_africa + b_r * ruggedness + b_ar * is_cont_africa * ruggedness

while in my code:

mean = neural_net(input)

and then for both codes:

   with pyro.plate("data", len(ruggedness)):
        return pyro.sample("obs", dist.Normal(mean, sigma), obs=log_gdp)

The difference is that in the tutorial “a”, “b_a”,“b_r”,“b_ar” are defined from statements like:

a = pyro.sample(“a”, dist.Normal(0., 10.))

which tells pyro that a is a parameter to optimize and provides its prior.
In the tutorial pyro tries to find the posterior probability for all the parameters “a”, “b_a”,“b_r”,“b_ar” given the observations.
In my example the neural_net comes from pytorch so my question is: how does pyro optimize the weights of the neural net?
Then I have another question. In the tutorial the user somehow provides a reasonable prior to the parameters “a”, “b_a”,“b_r”,“b_ar”. I would like to avoid stating a reasonable prior for the weights of the neural net (I would have too many parameters) but at the same time I feel like I do have to provide some prior.
Right now when the weights of the NN are initialized (at random following pytorch default) the mean might assume nonsense values and so I might be passing nonsense prior to pyro:

   with pyro.plate("data", len(ruggedness)):
        return pyro.sample("obs", dist.Normal(mean, sigma), obs=log_gdp)

I hope my questions are understandable, if not I will try to elaborate better.
Thanks for all the help and the effort.

if you want do formulate bayesian neural networks in pyro we recommend you use tyxe.

as a beginner user you may want to avoid bayesian neural networks as they constitute an active research area and may be tricky to get to work correctly.

if you want to implement a simple bayesian neural network by hand i suggest you write it out explicitly (without using modules), maybe something like the following:

w1 = pyro.sample("w1", dist.Normal(torch.zeros(4, 5), torch.ones(4, 5)).to_event(2))
w2 = pyro.sample("w1", dist.Normal(torch.zeros(5, 2), torch.ones(5, 2)).to_event(2))
sigma = 0.1
mean = nonlinearity(X @ w1) @ w2
with pyro.plate("data", len(y)):
    pyro.sample("obs",dist.Normal(mean, sigma).to_event(1), obs=y) 

as a general rule, always start simple before adding additional complexity.

Thanks again.
Actually i want to avoid using a bayesian neural network as well (as you said I would like to avoid the complexity).
What I was asking is if there’s a clever way to provide a plausible prior.
Right now the prior is governed by a NN with weights initialized randomly:

mean = neural_net(input)
sigma = neural_net2(input)

Imagine the weights of the NN randomly initialized give an initial sigma really close to zero and the initial mean far from it’s true value. This wrong prior will be extremly hard to overcome.
Is there a way to avoid this?
The only thing that comes to my mind is initializing the weights of the NN to obtain the chosen mean and sigma but how?

i’m sorry but in so far as i understand what you’re trying to do it doesn’t seem like it’s a sensible approach. i don’t see any reason why a prior like that would ever make sense. what is the high-level modeling assumption about the data that you’d like to me? if you want to go beyond a linear model reasonable approaches might include a gaussian process or a two-layer bayesian neural network. alternatively you could use a linear model in an extended (fixed) feature space, including e.g. linear and quadratic interaction terms.