torch.Tensor.cumsum(dim) bug in DLM tutorial?

I’m getting some really weird behavior with the torch.Tensor.cumsum(dim) method when playing around with the DLM tutorial.

Basically, I’m pretty sure the pyro.sample(“drift”, …) statement inside a pyro.plate (technically, time_plate from the contrib.forecasting module) is creating a latent variable of size [1095, 6] (representing 1095 observations in 6 predictor time series). Since the sample is inside a plate, the 1095 rows are treated as i.i.d. local latent variables (in the tutorial, these are latent drift variables).

Then outside the plate, we do a simple cumulative sum down the rows to mimic a trend component for each of the 6 predictor time series. Since the tensor is only 2 dimensions, then tensor.cumsum(0) should give the exact same result as tensor.cumsum(-2), right? And yet, when I run and fit the model, the fit is far superior when I use tensor.cumsum(-2) than when I use tensor.cumsum(0).

I tried each different argument (dim 0 and dim -2) in a simpler model and they both gave near identical fits, so not sure what I’m missing in this model? I also tried dim=0, dim=1, dim=2, dim=-2, dim=-1, and I can’t seem to mimic the fit from dim=-2 even though there should be an equivalent positive number (starting from the left) to dim=-2 (which starts from the right), correct?

The code for the model is below. The zero_data basically has shape [1095, 1] (or slightly less rows depending on the train set boundaries) and covariates has shape [1095, 6] for 5 predictors and 1 intercept.

class DLM(ForecastingModel):
    def model(self, zero_data, covariates):
        feature_dim = covariates.size(-1)

        # Global scale.
        drift_scale = pyro.sample("drift_scale", dist.LogNormal(-10, 10).expand([feature_dim]).to_event(1))
        # Local latent drift variable sampled as i.i.d.
        with self.time_plate:
            with poutine.reparam(config={"drift": LocScaleReparam()}):
                drift = pyro.sample("drift", dist.Normal(torch.zeros(covariates.size()), drift_scale).to_event(1))
        # Outside the plate.
        weights = drift.cumsum(dim=-2)  #<<ISSUE: dim=0 gives worse fit???>>
        pyro.deterministic("weights", weights)
        prediction = (weights * covariates).sum(-1, keepdim=True)
        pyro.deterministic("prediction", prediction)

        # Noise distribution.
        scale = pyro.sample("noise_scale", dist.LogNormal(-5, 10).expand([1]).to_event(1))
        noise_dist = dist.Normal(0, scale)
        self.predict(noise_dist, prediction)

Ok, I did some digging through the source code of the pyro.contrib.forecast module, and attempted to implement the exact same model in the DLM tutorial except using more general Pyro syntax (without the use of the contrib module). I came up with the below model, which gives the exact same fit when I use cumsum(dim=-2) and cumsum(dim=0) as expected. I pasted this model code below for reference… (technically, it’s not the exact same as the tutorial because the contrib module formulates things in terms of a residual distribution, while I wrote the more familiar likelihood format, which is the same thing but written in a different way… also the last pyro.plate in the likelihood is not explicitly used in the contrib module but the fit is the same with and without it).

def dynamic_linear_model(data, covariates):
    # Remove all dimensions of size 1 from the time series data.
    data = data.squeeze()
    # Get number of predictor variables.
    features_dim = covariates.size(-1)

    # - Dynamic coefficients.
    drift_scale = pyro.sample("drift_scale", dist.LogNormal(-10, 10).expand([features_dim]).to_event(1))
    with pyro.plate("time", len(data), dim=-1):
        with poutine.reparam(config={"drift": LocScaleReparam()}):
            drift = pyro.sample("drift", dist.Normal(torch.zeros(covariates.size()), drift_scale).to_event(1))
    weights = drift.cumsum(dim=0)

    # - Linear model.
    prediction = (weights * covariates).sum(dim=-1)

    # - Track model components.
    pyro.deterministic("weights", weights)
    pyro.deterministic("prediction", prediction)

    # - Likelihood.
    # Prior for the observation scale.
    scale = pyro.sample("scale", dist.LogNormal(-5, 10))
    with pyro.plate("data", len(data)):
        pyro.sample("obs", dist.Normal(prediction, scale), obs=data)

Since I was able to get this working as intended using general Pyro syntax, I think there might be a very minor bug in the contrib module that was causing the previous behavior when cumsum(dim=0)… I think it’s possibly related to some of the complex reshaping logic in the predict() method of the ForecastingModel class. However, for my purposes, I will just continue with the model I wrote using the more general syntax.

Hi @student_12,

tl;dr always try to use negative index values like .cumsum(-2).

In general it is always safer when using Pyro to use negative dimension indexing (indexing from the right) like .cumsum(-2) rather than nonnegative dimension indexing (from the left) like .cumsum(0). This is important because many inference algorithms make use of multi-sample batching by expanding tensors on the left, e.g. pyro.controb.forecasting.ForecastingModel draws multiple samples during prediction by expanding all tensors by batch_size on the left. Many other inference algorithms do this, even the basic Trace_ELBO. Another use case of expansion on the left is in Pyro’s enumeration machinery. For some more details see the tensor shapes tutorial.

Ah ok, thanks for the info! I’ll keep that in mind!