Bayesian Structural Time Series Question

Hi, I’m trying to construct a Bayesian Structural Time Series model with just a local linear trend and a quarterly seasonality. I’m pretty sure I have the model specification set up correctly but it’s giving really bad results when training using SVI on some simulated data.

I’m not sure what could be wrong here, maybe AutoNormal isn’t a good guide for this type of time series model? Or is variational inference not a good method for this type of model?

The model specification is below. It might look a little weird but I’m pretty sure the matrices used are correct (e.g., the season matrix is only 3x3 since it’s constrained so impacts sum to 0 for identifiability, and I verified it from Kevin Murphy’s book). But something must be wrong somewhere since when I take some guide traces from the posteriors on in-sample data, the quantiles around the true values of the parameters (since the data is simulated) are way off.

def model(data):
    time_dim = data.shape[0]
    data_init, data_mean, data_std = data[0], data.mean(), data.std()

    # - Priors.
    init_level = pyro.sample('init_level', dist.Normal(data_init, data_std).expand([1]))
    init_slope = pyro.sample('init_slope', dist.Normal(0, data_std).expand([1]))
    init_seas = pyro.sample('init_seas', dist.Normal(data_mean, data_std).expand([3]).to_event(1))
    z = torch.cat([init_level, init_slope, init_seas])

    level_scale = pyro.sample('level_scale', dist.HalfNormal(3.).expand([1]))
    slope_scale = pyro.sample('slope_scale', dist.HalfNormal(3.).expand([1]))
    seas_scale = pyro.sample('seas_scale', dist.HalfNormal(3.).expand([1]))
    state_scale = torch.cat([level_scale, slope_scale, seas_scale, torch.ones(2) * 0.001])
    obs_scale = pyro.sample('obs_scale', dist.HalfNormal(3.))

    # - State dynamics.
    trend_state_block = torch.tensor([[1., 1.], [0., 1.]])
    seas_state_block = torch.cat([torch.full((1, 3), -1.), torch.eye(3)[:-1, :]])
    state_matrix = torch.block_diag(trend_state_block, seas_state_block)

    # - Observation dynamics.
    trend_obs_block = torch.tensor([[1., 0.]])
    seas_obs_block = torch.tensor([[1] + [0] * 2])
    obs_matrix = torch.cat([trend_obs_block, seas_obs_block], dim=1)

    # - Likelihood.
    state_list, obs_list = [], []
    for t in range(1, time_dim + 1):
        z = pyro.sample(f'z_{t}', dist.Normal(state_matrix @ z, state_scale).to_event(1))
        x = pyro.sample(f'x_{t}', dist.Normal(obs_matrix @ z, obs_scale).to_event(1), obs=data[t-1])
        state_list.append(z)
        obs_list.append(x)
    pyro.deterministic('state_list', torch.stack(state_list))
    pyro.deterministic('obs_list', torch.stack(obs_list))

Any insight is appreciated. Since I’m fitting through SVI and only pulling posteriors of the latents from in-sample periods, I don’t need to do anything with the Kalman filter, right?

To add a little context to the model specification since it might look a bit obscure… it follows the same structure of a hidden Markov model in that there are latent vectors that follow a Markov process (so z_t depends only on z_t-1) and then the scalar observation x_t is ‘emitted’ based on z_t. With this BSTS model, there are 2 latent trend variables (one for level and the other for slope) and 3 latent seasonal variables (due to constraining quarterly seasonal effects to sum to 0). All of this is pretty standard, as are the matrices in the model:

The transition matrix for trend is:
[[1, 1],
[0, 1]]

So for example, this matrix times column vector [level_t, slope_t] plus Gaussian noise gives us [level_t+1, slope_t+1] where we see the level is based on the previous level and slope, and the slope is a random walk from the previous slope value.

The transition matrix for seasonality is (this may look weird but it’s from two books recommending this way):
[[-1, -1, -1],
[1, 0, 0],
[0, 1, 0]]
When we concatenate trend and seasonality diagonally, we get the state transition matrix:
[[1, 1, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 0, -1, -1, -1],
[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0]]

Then since there are 5 latent variables, the observation x_t is calculated from the state z_t by multiplying the state vector by this observation vector:
[[1, 0, 1, 0, 0]].

All of the above is pretty standard and the matrices follow exactly what several books prescribe. However, when I fit the model and try to look at the results, they’re off. So I feel like there must be something I must be missing with how to fit these types of models using SVI and mean-field AutoNormal guide.

Specifically when I try to view the most probable values of the observations (x_1, x_2, … x_n) in-sample based on ~100 guide traces from the fit model, they’re way off from the actual data. I’m wondering if maybe I should just do a guide trace on the noise parameters and then try to use the Kalman filter to get the posterior of the local latent variables and then the predictive distribution of the observations from that instead of getting them directly from the guide trace?

Okay, I think I figured out the issue, so just posting this for posterity…

Basically, it seems that running a lot of guide traces on a model with sequential local latent {z1, z2, …, zn} variables doesn’t produce the true posterior for those variables. You actually do have to use a forward-backward (Kalman filter/smoother) algorithm to get the posteriors for {z1, …, zn} and also by extension to get the predictive distribution of the {x1, …, xn}. I was getting confused because I saw some of the Pyro time series models were getting the posteriors for the local latent z’s by running guide traces, but in those models the z’s weren’t generated sequentially (so they were a little closer to parameters than latent variables, though I know the language around that area is a bit imprecise).

I was looking at the tensorflow probability implementation/codebase for this type of model and they do use a mean-field (i.e., AutoNormal) variational posterior, so that wasn’t the issue. But when they make forecasts, they get guide trace samples to get the posterior of the parameters only (i.e., the variances), and then apply the Kalman filter equations to get the exact posterior of the latent z’s (and predictive x’s). So even though variational inference is used to get the approximate posterior for the variance parameters , Kalman filtering is used to get the exact posterior of the latent z’s (technically, since the posterior for the z’s is based on the approximate variance parameters, it’s not exact even though the forward-backward algorithms give exact posteriors).

I would assume I could (theoretically) use sampling to get variational sampling posteriors for the z’s, but it would require getting a set of guide traces for the initial z0, then for each of those, get a set of guide traces for z1, then for each of those a set of guide traces for z2, and so on, which is a ton of sampling. I think since the model uses parametric distributions and the closed form solution is available, I’ll just implement the exact posterior formulas for the z’s and predictive posterior of x’s. I think I read Pyro implements forward-backward for discrete latent (HMM) models, but I don’t think there’s an existing implementation for continuous latent models that’s built-in in the same way?