Conditioning on available samples of minibatch

Mowgli · December 7, 2021, 3:00pm

Hi everyone,

I would like to develop a time series model where the output at each timesteps output_t depends on some input_t (input at that same timestep) and the output of previous timestep, output_{t-1}.

Both in training and in inference I will have some observations of output at different timesteps.

Let’s imagine a dataset of N timeseries of lenght T_max = 6 where I have observed the :

1_output = [None,None,obs_3,None,None,obs_6]
2_output = [obs_1,obs_2,None,None,None,obs_6]
.....
N_output = [None,obs_2,obs_3,None,None,None]

So for the first timeseries I have observed the output at time 3 and 6.
For the second timeserie I have observed the output at time 1,2 and 6.
…

And aggregating all the output tensors in a batch i.e batch.shape = [N,T_max]

My question is, how do I condition the sampling to only the observations I have?
Looking at the deep markov model tutorial I thought something like:

for t in range(batch.size(1)):
     [some code]
     out_t = pyro.sample("obs_x_%d" % t, dist.Normal(loc_out, scale_out), obs=batch[:, t ])

But how can I specify to condition only on the observation I have?
I mean can I somehow mask the observations?
I guess a possible solution might be:

for t in range(batch.size(1)):
    for sample in range(batch.size(0)):
       out_t[sample] = pyro.sample("obs_x_%d" % t, dist.Normal(loc_out[sample],scale_out[sample]),obs=batch[sample, t ])

But it looks odd, and I am wasting vectorization.
I tried to make it clear but I understand it might be kinda confusing, I am more than available to elaborate more. Please Help =D

eb8680_2 · December 10, 2021, 12:15am

You can use pyro.poutine.mask:

mask = torch.tensor([
    [x is not None for x in 1_output],
    ...
])

assert mask.shape == (N, T_max)
assert batch.shape == (N, T_max)
with pyro.plate("batch", N):
    for t in range(T_max):
        ...
        with mask(mask=masks[..., t]):
            out_t = pyro.sample(f"obs_x_{t}", Normal(loc_out, scale_out), obs=batch[:, t])

See the hidden Markov model examples for other examples of mask usage.

Mowgli · December 10, 2021, 1:51pm

Thanks for your kind answer.
Sadly I am having troubles with poutine mask even with very simple examples.

> observations = torch.rand(10)
> 
> mask_ones = torch.ones(10,dtype = torch.bool)
> mask_zeros = torch.zeros(10,dtype = torch.bool)
> 
> with poutine.mask(mask = mask_ones):
>         out = pyro.sample("test",
>         dist.Normal(loc,scale),obs = obs)
>         print(out)
> 
> with poutine.mask(mask = mask_zeros):
>         out = pyro.sample("test", dist.Normal(loc,scale),obs = obs)
>         print(out)

I would expect to see a difference in the two printed outputs, since one should be conditioned on the observation and one shouldn’t but I obtain the same results for both (out is equal to observations).
What am I missing?

martinjankowiak · December 10, 2021, 2:34pm

mask has an effect on the log_prob that corresponds to the sample statement not on the random values themselves.

Mowgli · December 10, 2021, 3:41pm

I don’t understand, could you elaborate further please? Which probability am I affecting by masking some observations?

Btw what I want to achieve is to develop a time series model which employs input_t and out_{t-1} to predict out_t . Out_t can either be observed (if the observation is available):
out_t = pyro.sample("obs_x_%d" % t, dist.Normal(loc_out, scale_out), obs=batch[:, t ])

or not:

out_t = pyro.sample("obs_x_%d" % t, dist.Normal(loc_out, scale_out))

This is because at each timestep i assign
loc_out = out_{t-1} + f(input_t)

where f is an arbitrary function.
Does the approach make sense?

martinjankowiak · December 10, 2021, 5:39pm

the basic idea of mask in pyro boils down to the fact the vectorized operations are much faster than for loops. imagine i have a 10 observations only 5 of which are actually observed. then i could write something like

pyro.sample("obs0", ..., obs=data[0])
pyro.sample("obs2", ..., obs=data[2])
pyro.sample("obs3", ..., obs=data[3])
pyro.sample("obs7", ..., obs=data[7])
pyro.sample("obs8", ..., obs=data[8])

this would be annoying and slow. instead we can write something like

with pyro.plate("data", 10):
    pyro.poutine.mask(mask=mask):
        pyro.sample("obs", ..., obs=data)

under the hood this will still compute 10 log probabilities for the 10 observations (to keep the computation vectorized and fast) but 5 of them will be ignored by whatever inference algorithm is used

Mowgli · December 14, 2021, 9:31am

Sorry for the delay in the answer, it has been a rough week.
Thanks a lot, it’s way more clear now. I still have a question though.

Imagine I have to develop a model which predicts the total sales performed by a shop based on some daily input.
So calling s_0,s_1,…,s_t the sales performed in day 0, day 1, … day_t I have to predict:

out_0 = s_0
out_1 = s_0 + s_1
…
out_t = s_0 + s_1 + …y… + s_t

Sadly I don’t have all the observations (so as stated in a previous question for a time series I might have:

obs = [out_0,None,out_2,None,…,out_t]).

Now in my model I will probably state something like:

daily_sales = f(input) #where f is a learnable function 

loc_out_t = out_{t-1} + daily_sales 
scale_out_t = g(scale_out_{t-1},input) # where g is a learnable function

finally

with pyro.plate("data", N_batch):
        pyro.poutine.mask(mask=mask):
                  out_t = pyro.sample("obs_x_%d" % t,
                              dist.Normal(loc_out_t, scale_out_t),
                              obs=obs[:,t])

So I do have a couple of questions:

Is the approach correct?
If I have not observed out_{t-1}, loc_out_t will still use the value provided by obs in t-1 as the value of out_{t-1} despite the fact that is masked (because as u explained me before it’s masked for the computation of the log probability).

So I will have for example

out_1 = None

and

loc_out_2 = None + daily_sales.

What I think it should use instead in case of missing observation would be:

out_{t-1} = pyro.sample("obs_x_%d" % t, dist.Normal(loc_out_{t-1}, scale_out_{t-1}))

If you can help me figure out this would be incredbly useful, thanks a lot for your effort!

eb8680_2 · December 14, 2021, 2:09pm

Hi @Mowgli, if I am reading your post correctly, future observations will depend on missing values at previous timesteps, so you will need to impute missing values rather than simply hiding them with poutine.mask.

One way to do this is to use the obs_mask keyword argument to pyro.sample, which takes a mask like the ones you have been passing to poutine.mask and uses it to create two sample sites, one observed and one latent, and merges them into a single array using the mask.

Your snippet above would be rewritten as something like

with pyro.plate("data", N_batch):
    out_t = pyro.sample("obs_x_%d" % t,
                        dist.Normal(loc_out_t, scale_out_t),
                        obs=obs[:,t], obs_mask=mask)

Mowgli · December 14, 2021, 2:19pm

Super! Thanks a lot! Just tested it on some fake data and looks like it does exactly what I need!
I will surely need some more help but I least I got a brick!
Thanks again, I am really lost alone out here =D

Mowgli · December 20, 2021, 8:58am

Hi @eb8680_2, sorry for bringing up this again.
I tried developing a model where, during train phase, I pass a vector of observations and a obs_mask but I got dissapointing results.
What I noticed is that when I include the argument obs_mask for some reason the training becomes way more difficult even if the obs_mask is just full of True values.
Changing

    with pyro.plate("data",x.shape[0]):
        obs = pyro.sample("obs",dist.Normal(mean_output,sigma_output).to_event(1),obs = y)

to

    with pyro.plate("data",x.shape[0]):
        obs = pyro.sample("obs",dist.Normal(mean_output,sigma_output).to_event(1),obs = y,obs_mask=obs_mask)

slows down training and gives worse results even with

obs_mask = torch.bernoulli(torch.ones(N_batch)) > 0 # bool tensor full of “True” values

Furthermore it looks like the loss is computed in a different way, I sometimes get strange loss values (loss oscillating between -5e21 to -5 e24 while before I had loss decreasing from ~+200 to -1300) and I have to pick lower learning rates to avoid divergence.

Any clue on why this happens?
Here’s my code, in case it might be useful to understand the problem

class BayesianRegression(PyroModule):
    def __init__(self,in_features,out_features,latent_features,hidden_features = 4):
        super(BayesianRegression, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.latent_features = latent_features
        self.hiddden_features = hidden_features

        self.linear_input_hidden = nn.Linear(in_features,hidden_features)
        self.linear_hidden_latent = nn.Linear(hidden_features,latent_features)

        self.linear_latent_hidden = nn.Linear(latent_features,hidden_features)
        self.linear_hidden_hidden2 = nn.Linear(hidden_features,hidden_features*2)
        self.linear_hidden2_hidden3 = nn.Linear(hidden_features*2,hidden_features)
        self.linear_hidden_output = nn.Linear(hidden_features,out_features)

        self.linear_input_sigma = nn.Linear(latent_features,out_features)


        self.relu = nn.ReLU()
        self.softplus = nn.Softplus()

    def forward(self,x,y = None,latent_obs = None , obs_mask = None):

        sigma_latent = pyro.sample("sigma_latent",dist.Uniform(0.,1.))
        mean_latent_ = self.relu(self.linear_input_hidden(x))
        mean_latent = self.linear_hidden_latent(mean_latent_)


        with pyro.plate("latent_data",x.shape[0]):
            latent = pyro.sample("latent_obs",dist.Normal(mean_latent,sigma_latent).to_event(1),obs = latent_obs)

        sigma_output = self.softplus(self.linear_input_sigma(latent))

        mean_output_ = self.relu(self.linear_latent_hidden(latent))
        mean_output_ = self.relu(self.linear_hidden_hidden2(mean_output_))
        mean_output_ = self.relu(self.linear_hidden2_hidden3(mean_output_))

        mean_output = self.linear_hidden_output(mean_output_)

        with pyro.plate("data",x.shape[0]):
            obs = pyro.sample("obs",dist.Normal(mean_output,sigma_output).to_event(1),obs = y) #,obs_mask=obs_mask)

Basically I have a NN which from input computes some latent variables (self.linear_input_hidden, self.linear_hidden_latent).
The latent variables are also observed ( latent = pyro.sample(…,obs = latent_obs).
The latent pass through another two NNs, one for output mean (self.linear_latent_hidden … self.linear_hidden_output) and one for output sigma (self.linear_input_sigma).

martinjankowiak · December 26, 2021, 6:39pm

presumably your obs_mask is the wrong shape. what shape are mean_output, x, y, etc?

Mowgli · December 27, 2021, 9:27am

Thanks for answering again!

y.shape = [N_batch,output_dim] = [100,2]
x.shape = [N_batch,input_dim] = [100,4]
latent.shape = [N_batch,latent_dim] = [100,2]

mean_output.shape = [N_batch,output_dim] = [100,2]
sigma_output.shape = [N_batch,output_dim] = [100,2]

and finally:

obs_mask.shape = [N_batch] = 100

I tried to make obs_mask.shape = [N_batch,output_dim] but I got an error so I went back to obs_mask.shape = [N_batch].

Any clue?

Bonus question:
I have another question, I am a little confused on how pyro treats the output of the neural net (in my case mean_output and sigma_output).
What I understood is that during optimization the torch weights are optimized to obtain the posterior probability of mean_output and sigma_output. The posterior is computed by pyro from the prior (which in this case is the output of the NN ) and from the observations. Did I get it right?
If so I am passing a “random prior” since I have left the initialization of the weights of the neural net to pytorch.
At the beginning the weights of the nn are not initialized so they give to mean_output and sigma_output a value which is kinda random (depending on the random initialization of the weights by pytorch). Does those random values act as a prior probability? If so how do I ensure that these priors are plausible?

martinjankowiak · December 28, 2021, 6:34pm

i’m sorry i don’t understand what you’re trying to do. can you explain? maybe start with something simpler before adding a bunch of neural networks.

Mowgli · December 29, 2021, 8:55am

Sure!
I want to train a model which, from some input variables, computes the output variables exactly as the ruggedness-log gdp introduction tutorial.
So to compare to the ruggedness tutorial my input would be the ruggedness and my output the log gdp of the countries. The difference is I don’t want to assume a linear relationship but I rather learn the function in between (that’s why I’m using the NN).
In the tutorial:

mean = a + b_a * is_cont_africa + b_r * ruggedness + b_ar * is_cont_africa * ruggedness

while in my code:

mean = neural_net(input)

and then for both codes:

   with pyro.plate("data", len(ruggedness)):
        return pyro.sample("obs", dist.Normal(mean, sigma), obs=log_gdp)

The difference is that in the tutorial “a”, “b_a”,“b_r”,“b_ar” are defined from statements like:

a = pyro.sample(“a”, dist.Normal(0., 10.))

which tells pyro that a is a parameter to optimize and provides its prior.
In the tutorial pyro tries to find the posterior probability for all the parameters “a”, “b_a”,“b_r”,“b_ar” given the observations.
In my example the neural_net comes from pytorch so my question is: how does pyro optimize the weights of the neural net?
Then I have another question. In the tutorial the user somehow provides a reasonable prior to the parameters “a”, “b_a”,“b_r”,“b_ar”. I would like to avoid stating a reasonable prior for the weights of the neural net (I would have too many parameters) but at the same time I feel like I do have to provide some prior.
Right now when the weights of the NN are initialized (at random following pytorch default) the mean might assume nonsense values and so I might be passing nonsense prior to pyro:

   with pyro.plate("data", len(ruggedness)):
        return pyro.sample("obs", dist.Normal(mean, sigma), obs=log_gdp)

I hope my questions are understandable, if not I will try to elaborate better.
Thanks for all the help and the effort.

martinjankowiak · December 30, 2021, 6:34pm

if you want do formulate bayesian neural networks in pyro we recommend you use tyxe.

as a beginner user you may want to avoid bayesian neural networks as they constitute an active research area and may be tricky to get to work correctly.

if you want to implement a simple bayesian neural network by hand i suggest you write it out explicitly (without using modules), maybe something like the following:

w1 = pyro.sample("w1", dist.Normal(torch.zeros(4, 5), torch.ones(4, 5)).to_event(2))
w2 = pyro.sample("w1", dist.Normal(torch.zeros(5, 2), torch.ones(5, 2)).to_event(2))
sigma = 0.1
mean = nonlinearity(X @ w1) @ w2
with pyro.plate("data", len(y)):
    pyro.sample("obs",dist.Normal(mean, sigma).to_event(1), obs=y)

as a general rule, always start simple before adding additional complexity.

Mowgli · December 30, 2021, 7:30pm

Thanks again.
Actually i want to avoid using a bayesian neural network as well (as you said I would like to avoid the complexity).
What I was asking is if there’s a clever way to provide a plausible prior.
Right now the prior is governed by a NN with weights initialized randomly:

mean = neural_net(input)
sigma = neural_net2(input)

Imagine the weights of the NN randomly initialized give an initial sigma really close to zero and the initial mean far from it’s true value. This wrong prior will be extremly hard to overcome.
Is there a way to avoid this?
The only thing that comes to my mind is initializing the weights of the NN to obtain the chosen mean and sigma but how?

martinjankowiak · December 31, 2021, 12:43am

i’m sorry but in so far as i understand what you’re trying to do it doesn’t seem like it’s a sensible approach. i don’t see any reason why a prior like that would ever make sense. what is the high-level modeling assumption about the data that you’d like to me? if you want to go beyond a linear model reasonable approaches might include a gaussian process or a two-layer bayesian neural network. alternatively you could use a linear model in an extended (fixed) feature space, including e.g. linear and quadratic interaction terms.