Scan smoothes, doesn't forward filter? state-space, forecasting, dynamic linear models

djchester · March 26, 2024, 4:59pm

This topic might have been better to add on to Forecasting with a stochastic volatility model in numpyro.

I have been wondering for some time whether or not the scan function in a time series model with time varying weights (DLM etc.) are actually providing posterior samples from the perspective of smoothing, not forward filtering. I have attached a very simple example in a google colab notebook (please let me know if anything is not working, if anything is unclear, or if I’ve made an error - it was somewhat hastily done). As opposed to repeating myself with less clarity, I think the notebook will provide a better idea of my concern. Of course, from the perspective of smoothing this is all OK, but I was hoping to know for certain what is going on. The backwards smoothing perspective in relation to the scan function seems to coincide with the paper quoted in the examples ( Simo Särkkä, Ángel F. García-Fernández)… at least, according to my brief skim.

I did re-run the hmm example (Example: Hidden Markov Model — NumPyro documentation), and did notice commentary about the forward filter with scan, however, I did not analogize it in detail to the time setting nor experiment to see if there was any “look-ahead” bias in the parameters. I also briefly tried to get the hmm_enum examples running, but after a bit of tinkering on my local pycharm I gave up (for now).

Here is the link:
link

In the event that scan is conducting backwards smoothing (using information all the way up to end point time T), does anyone know if there is a way to manipulate scan to get forward filtering using only information only up to time t (for all t)? I suppose I could use scan at each run up to t_2, then t_3 and so on, take out the posteriors, evolve them, and make a forecast… but this would be quite a bit more computation than I’d hope to do…

I very much hope I am using scan in the wrong way and that forward filtering posteriors are attainable! Any help at all would be greatly appreciated!
Thank you

djchester · April 15, 2024, 10:59pm

Just a further point of clarification / question:

I noticed that within the scan source code the paper quoted (Sarkka and Garcia-Fernandez) mentioned extending the linear/Gaussian framework to nonlinear/non-Gaussian models. Given the use of NUTS within scan (docu forecasting example), I presume this extension has been found and integrated into numpyro/scan?
Is there further documentation/references to understand how HMC integrates with scan? Or, are nonlinearities (for example, in the state equation) actually not modelled correctly when the model is run? I ask since the function documentation says “Not all transition functions f are supported”, and given the paper mentioned had not yet solved for nonlinear/non-Gaussian cases, I wondered if this might include such nonlinear/non-Gaussian cases, or if the comment regarding f was more about other cases such as a nested scan (which is unsupported)
I went ahead and added a section to the google colab note mentioned above (link here again) that explores a nonlinear state equation. It seems to be modelling the nonlinear state equation, endorsing the idea that HMC has correctly been integrated with the scan functionality… but I’d love to see more detail/references on how this accomplished. I didn’t go all the way and compare the nonlinear results with that of an unscented kalman filter, for example, but I presume the results would be comparable…
Last, if anyone has any updates on using scan for forward filtering coefficients (i.e. without look-ahead bias), I would be super appreciative!

djchester · April 16, 2024, 4:10pm

Ok, I have further checked the model versus a simple for loop in the time series.

It behaves like scan
which seems like a problem! both are displaying look-ahead bias in the posteriors… would love to be wrong, but this doesn’t look good…
…or, and has been done elsewhere with scan, the model is simply only trained up to a point 0:T and then forecasted from T:end, wherein the posterior parameters would have no opportunity to learn from data past T. This is fine too, but in order to get posteriors (within scan or a for loop, it seems) is simply to rerun a model from 0:t, again again from 0:t+1, and so on, taking the last posterior of each time point… is there anyway to avoid this heavy rolling-window esque computation in order to achieve posteriors without look-ahead bias (and hence forecasts without look-ahead bias)?

for loop section also added to this google colab notebook:

fehiepsi · April 16, 2024, 9:35pm

Hi @djchester, in those examples, the posterior is conditioned on the full data. To get forward filter distribution, you might want to use TFP LinearGaussianSSM.

djchester · April 16, 2024, 10:15pm

hi @fehiepsi! Thanks for your response.

I may have simplified my model a bit for demonstrative purposes, however, the models I actually want (/have) built are nonlinear Gaussian state space models (pursuing full hmc as opposed to a sequential monte carlo particle fitler). Presumably, forward filtered posteriors (and hence forecasts) would be impossible within an hmc based for-loop or scan model within numpyro? [I believe this is what you are saying]

Then, I can confirm that from a smoothing perspective numpyro’s scan + hmc works well for state learning in a nonlinear (in the state equation) framework – on par or better than an unscented kalman filter, for example. However, the paper mentions that nonlinear non-Gaussian extensions to the scan functionality was for future work… so, then, I guess I am very curious as to how numpyro extended the smoother as scan + hmc (and hence nonlinear frameworks) clearly works! [there is an example in the colab notebook as well, and one can see a clear difference in the linear vs. nonlinear model]

Finally, at one point I implemented a non-Gaussian state space model with scan and without scan but found the time it took was the same. Does scan recognize the non-Gaussian distributions in the sampler and default to a loop? when using Gaussian state space models, scan is of course significantly faster.

fehiepsi · April 16, 2024, 11:35pm

We do forward sampling forecasts using Predictive, like in the Time Series Forecasting — NumPyro documentation example. To get posterior, we simply use HMC on the potential function -p(obs, latent). We do not use any fancy algorithms in scan.

djchester · April 17, 2024, 2:00pm

Thank you @fehiepsi! I see the train-test format as opposed to a one-step ahead ‘on-line’ style forecasting. I suppose the only way to do this would be a rolling-window style retraining of the model at each time step.

Then, it seems - and looks - like scan works very broadly with any model specification using hmc - such as nonlinear formulations (but perhaps not all non-Gaussian distributions).

I think the origination of my confusion is related to the source documentation of the scan primative numpyro.contrib.control_flow.scan — NumPyro documentation, particularly the definition of scan(), the reference paper is Temporal Parallelization of Bayesian Smoothers by Simo Särkkä, Ángel F. García-Fernández. Which, at the risk of repeating myself, was in reference to smoothing linear Gaussian state space models, and not nonlinear formulations…
But, this then implies there has been more work on making scan() usable in general settings beyond what the reference paper discusses. I also checked jax._src.lax.control_flow.loops — JAX documentation, but could not find any reference papers
is there more material (papers?) to really understand how scan() works with general hmc styled models? Currently it seems a bit mysterious…

Thanks again!

martinjankowiak · April 17, 2024, 3:06pm

scan is just a fancy for loop. it does not itself do inference.

in numpyro:

scan is used in model definitions to define autoregressive priors
inference is done with inference algorithms like HMC or SVI
Predictive is used to take samples from an inference algorithm (it does not itself do inference) and pipe them back through a model to get various predictive quantities

djchester · April 17, 2024, 4:16pm

I see, it just strikes me as interesting to understand the scan aspect in a little more detail with respect to the parallelization that seems to be occurring, and how this parallelization is interplaying with an HMC mixer. (or, at least, this is how scan generally seems to be presented). The scan models certainly mix with speeds that imply parallelization…

from the scan() definition: “…The joint density is evaluated using parallel-scan (reference [Sarkka, Garcia-Fernandez]) over time dimension, which reduces parallel complexity to O(log(length))”
…of course this was a specific linear Gaussian case in the paper…so I would be curious how scan() uses parallel-scan over the time dimension for nonlinear models (that require hmc). References/papers for what’s underneath the hood of scan() in numpyro would be super super appreciated!

Then, (posterior) predictives are great, but of course, as you know this is not quite suitable for forecasts… no matter, as I think I see now that rolling-windows are probably the only way forward (and I’ll just have to eat the computational cost ) EDIT: Question, from above, it seems confirmed that all parameters in a scan (or loop) model are conditioned on the full data up to the last time point. Then, I presume that using Predictive within the scan-styled model (or loop) would sample out posteriors, say at t = 5, but unfortunately, the posteriors at t = 5 would have conditioned on data up to t = end, rendering the posterior predictive at t=5 a distribution that has actually conditioned on all data up to t = end. The reason I bring this up, is because if the posterior predictive were somehow only conditioning on data up to t=5, then one could evolve the posterior predictive as a true forecast density for t=6 without look-ahead bias. This would alleviate the need for re-running the hmc model at every time point (rolling window), but alas, this seems out of reach at the moment…

martinjankowiak · April 17, 2024, 6:19pm

parallel-scan is only used for hmm-like discrete latent variable models.

generic inference algorithms like HMC do not produce any intermediate filtering-like results. for that you need SMC or similar

djchester · April 17, 2024, 6:23pm

Perfect, thank you!

But, just to completely confirm (apologies for repetition), scan is really not using a parallelization for the hmc models (that do not involve latent discrete variables)? Scan is really just an improved for-loop?.. I find that fairly impressive given the speed-ups!

martinjankowiak · April 17, 2024, 9:56pm

in e.g. numpyro/examples/hmm_enum.py at master · pyro-ppl/numpyro · GitHub if priors are included log(T) parallel scan is used to marginalize out discrete latent variables and HMC does inference on the (remaining) continuous latents

djchester · April 17, 2024, 11:14pm

I think this makes sense, regarding scan for the continuous variables it says

“under the hood, scan stacks all the priors’ parameters and values into
an additional time dimension. This allows us computing the joint
density in parallel.”
Rather, if we have \theta_{1,t} = \theta_{1,t-1} + \varepsilon_{1,t}, \varepsilon_{1,t} \sim N(\cdot,\cdot), then scan would lump (\theta_{1,t=1},\ldots, \theta_{1,t=T}) into a single multivariate prior and similarly if \theta_{2,t} = \theta_{2,t-1} + \varepsilon_{2,t}, \varepsilon_{2,t} \sim N(\cdot,\cdot) then we would lump (\theta_{2,t=1},\ldots, \theta_{2,t=T}) into another multivariate prior, and then perform HMC with (y_{t=1},\ldots,y_{t=T}) where y_{t}(\theta_{1,t},\theta_{2,t}). Rather, this is in opposition to running HMC at each time point ~{\Leftrightarrow} “vectorizing > looping”. Reasonable since the joint is compositional (\theta_{1,t},\ldots,\theta_{1,T}) = p(\theta_{1,t=1})\prod_{t=2}^{T}p(\theta_{1,t}|\theta_{t,t-1})… (and, not a profound statement, but this works for any nonlinear formulation)… this would definitely bear faster mixing…

martinjankowiak · April 17, 2024, 11:51pm

actually i think i may have accidentally mislead you.

scan also handles linear gaussian models using funsor; however it’s still the case that the computation that is being done during inference is marginalization (so in effect smoothing).

@ordabayev can provide a more detailed answer than i