Error when running importance.psis_diagnostic() function

Hello!

I’m trying to use the importance.psis_diagnostic() function to evaluate the fit of my bayesian linear regression model:

def linear_regression_model(x: torch.Tensor, y: torch.Tensor = None) -> torch.Tensor:
    """

    Parameters
    ----------
    x
        The input feature matrix of shape: (num_examples, num_features).
    y
        The response/target variable vector of size: num_examples.

    Returns
    -------
    mean
        The regression line.
    """
    weight = pyro.sample(
        'weight', dist.Normal(weight_loc, weight_scale).expand([x.shape[1], 1]).to_event(2))
    
    bias = pyro.sample('bias', dist.Normal(bias_loc, bias_scale).expand([1]).to_event(1))

    sigma = pyro.sample('sigma', dist.HalfNormal(scale=sigma_scale))

    mean = torch.matmul(x, weight).squeeze(dim=-1) + bias

    with pyro.plate('data', x.shape[0]):
        obs = pyro.sample('obs', dist.Normal(mean, sigma), obs=y)

    return mean

The model trains successfully and I’m able to use the Predictive class with ‘parallel=True’. However, when I try to run the psis_diagnostic method I get the following error:

k_hat = importance.psis_diagnostic(model, guide, x_train_torch, y_train_torch, num_particles = 100)

ValueError: Error while packing tensors at site 'obs':
  Invalid tensor shape.
  Allowed dims: -8, -1
  Actual shape: (100, 100, 1, 1, 1, 1, 1, 1, 5000)
  Try adding shape assertions for your model's sample values and distribution parameters.
Trace Shapes:                                 
 Param Sites:                                 
Sample Sites:                                 
  weight dist     100 1 1 1 1 1 1     1 | 60 1
        value     100 1 1 1 1 1 1     1 | 60 1
     log_prob     100 1 1 1 1 1 1     1 |     
    bias dist     100 1 1 1 1 1 1     1 |  1  
        value     100 1 1 1 1 1 1     1 |  1  
     log_prob     100 1 1 1 1 1 1     1 |     
   sigma dist     100 1 1 1 1 1 1     1 |     
        value     100 1 1 1 1 1 1     1 |     
     log_prob     100 1 1 1 1 1 1     1 |     
     obs dist 100 100 1 1 1 1 1 1 5000 |     
        value                     5000 |     
     log_prob 100 100 1 1 1 1 1 1 5000 |

In this example, I have 5000 observations and 60 features. Based on the error message, I’m not sure how to modify ‘obs’ to make it compatible with the psis_diagnostic() function?

if you want to use psis_diagnostic you need to make sure that your model code is fully vectorized because under the hood psis_diagnostic encloses the model in a plate. you probably need to remove .squeeze(dim=-1) (and possibly make other changes as well)

Hello @martinjankowiak ! Thank you so much, I really appreciate your response. I tried your suggestion to remove .squeeze(dim=-1) and now get the following error:

ValueError: at site "obs", invalid log_prob shape
  Expected [5000], actual [5000, 5000]
  Try one of the following fixes:
  - enclose the batched tensor in a with pyro.plate(...): context
  - .to_event(...) the distribution being sampled
  - .permute() data dimensions

I’m sorry, I thought that my code was fully vectorized because I needed to do that to make Predictive(parallel=True) work. I’ve read through the tutorial on tensor shapes but I can’t seem to figure out how else to modify my model. I’d greatly appreciate any other recommendations you have!

by “remove” i mean “remove and adapt appropriately”. you need to deal with the fact that every sample statement will now generate samples with an additional dimension on the left. it might be easier if you use pyro plates everywhere and avoid expand and the like

Thank you! I’ve been trying to wrap the weight and bias sample statements with plates:

    with pyro.plate('weight_plate', [x.shape[1], 1], dim=-2):
        weight = pyro.sample(
            'weight', dist.Normal(weight_loc, weight_scale))

to replace:

weight = pyro.sample(
        'weight', dist.Normal(weight_loc, weight_scale).expand([x.shape[1], 1]).to_event(2))

but I can’t seem to get the plate nesting correct. And I’m struggling with how to modify the mean calculation to get everything to work?
mean = torch.matmul(x, weight).squeeze(dim=-1) + bias

Would it be at all possible to get a simple example of how you would structure this linear regression using plates where the inputs are a 2D weight tensor and 1D observation tensor? I’ve tried my best to modify the original Bayesian Regression tutorial in the docs but I feel like I’m totally stuck. Thanks again!

Hi @martinjankowiak . Thank you again for your help! I’ve tried implementing your recommendations by removing the expand and wrapping all the sample statements with plates. I think I’m close but I seem to be adding an extra batch dimension to the obs site. Do you have any recommendations? I can’t seem to figure out if I’m nesting the plates properly for the linear regression.

def linear_regression_model(x: torch.Tensor, y: torch.Tensor = None) -> torch.Tensor:
    """
    Parameters
    ----------
    x
        The input feature matrix of shape: (num_examples, num_features).
    y
        The response/target variable vector of size: num_examples.
   
    Returns
    -------
    mean
        The regression line.
    """
 
    with pyro.plate("weight_plate", x.shape[1], dim=-2):
        weight = pyro.sample('weight', dist.Normal(weight_loc, weight_scale))
    
    with pyro.plate("bias_plate", 1, dim=-1):
        bias = pyro.sample('bias', dist.Normal(bias_loc, bias_scale))

    with pyro.plate('sigma_plate', 1, dim=-1):
        sigma = pyro.sample('sigma', dist.HalfNormal(scale=sigma_scale))

    mean = torch.matmul(x, weight).squeeze(dim=-1) + bias

    with pyro.plate('data', x.shape[0], dim=-1):
        obs = pyro.sample('obs', dist.Normal(mean, sigma), obs=y)

    return mean

The above returns:

ValueError: Error while packing tensors at site 'obs':
  Invalid tensor shape.
  Allowed dims: -8, -1
  Actual shape: (100, 100, 1, 1, 1, 1, 1, 5000)
  Try adding shape assertions for your model's sample values and distribution parameters.
Trace Shapes:                           
 Param Sites:                           
Sample Sites:                           
  weight dist 100   1 1 1 1 1 60     1 |
        value 100   1 1 1 1 1 60     1 |
     log_prob 100   1 1 1 1 1 60     1 |
    bias dist 100   1 1 1 1 1  1     1 |
        value 100   1 1 1 1 1  1     1 |
     log_prob 100   1 1 1 1 1  1     1 |
   sigma dist 100   1 1 1 1 1  1     1 |
        value 100   1 1 1 1 1  1     1 |
     log_prob 100   1 1 1 1 1  1     1 |
     obs dist 100 100 1 1 1 1  1  5000 |
        value                     5000 |
     log_prob 100 100 1 1 1 1  1  5000 |

can you please provide a runnable code snippet with import statements etc?

import torch
import pyro
import pyro.distributions as dist
import numpy as np
import pandas as pd
from torch import Tensor

from dataclasses import dataclass
from typing import List, Dict, Optional, Union, Any

from pyro.infer.autoguide import init_to_mean

pyro.enable_validation(True)


@dataclass
class PriorParams:
    """Class for specifying the parameters of the prior distributions used in the linear regression model."""
    # Regression coefficient priors.
    weight_loc: float = 0.0
    weight_scale: float = 1.0

    # Bias term prior.
    bias_loc: float = 0.0
    bias_scale: float = 1.0

    # Observation noise prior.
    sigma_scale: float = 10.0


def linear_regression_model(x: torch.Tensor, y: torch.Tensor = None, prior_params: PriorParams = None) -> torch.Tensor:
    """Implements a Bayesian model for linear regression that puts priors on the model parameters and observation noise.

    Parameters
    ----------
    x
        The input feature matrix of shape: (num_examples, num_features).
    y
        The response/target variable vector of size: num_examples.
    prior_params
        A 'dataclass' that stores the parameter values of the prior distributions used in the model.
        Note: If `prior_params` is None, then the default values from the class are used.

    Returns
    -------
    mean
        The regression line.
    """
    if prior_params is None:
        prior_params = PriorParams()

    # Regression coefficient priors.
    weight = pyro.sample(
        'weight', dist.Normal(prior_params.weight_loc, prior_params.weight_scale).expand([x.shape[1], 1]).to_event(2))

    # Bias term prior (aka the intercept or independent coefficient).
    bias = pyro.sample('bias', dist.Normal(prior_params.bias_loc, prior_params.bias_scale).expand([1]).to_event(1))

    # Observation noise prior (aka the random error in the response variable 'y').
    sigma = pyro.sample('sigma', dist.HalfNormal(scale=prior_params.sigma_scale))

    # Calculate the expected mean. This is a linear combination of the input features, weight matrix and bias vector.
    mean = torch.matmul(x, weight).squeeze(dim=-1) + bias

    # Plate context manager (vectorized).
    with pyro.plate('data', x.shape[0]):
        obs = pyro.sample('obs', dist.Normal(mean, sigma), obs=y)

    return mean


class BayesianRegressionModel:
    def __init__(self, random_state: int = 42):
        """Bayesian Regression Model.

        Attributes
        ----------
        random_state
            Seed used by the random number generator. This allows for repeatable sampling procedures.
        """

        # Set the seed used by the random number generator.
        self.random_state = random_state

        # Model evaluation metrics.
        self.training_losses: Optional[List[float]] = None

        # Pyro-specific attributes.
        self._model = None
        self._guide = None

    def fit(self,
            x_data: torch.Tensor,
            y_data: torch.Tensor,
            num_iterations: int = 1000,
            learning_rate: float = 0.001) -> None:
        """Fits the Bayesian Regression Model.

        Parameters
        ----------
        x_data
            Tensor of input features.
        y_data
            Tensor of observation values.
        num_iterations
            The number of iterations to run stochastic variational inference (SVI) for.
        learning_rate
            The learning rate used for optimizing the Evidence Lower Bound (ELBO) objective.

        Returns
        -------
        None
        """
        # Reset the random number generator.
        pyro.set_rng_seed(self.random_state)

        # Linear Regression Model.
        self._model = linear_regression_model

        # Pyro Guide.
        self._guide = pyro.infer.autoguide.AutoNormal(self._model, init_loc_fn=init_to_mean)

        # Variational Inference.
        pyro.clear_param_store()

        # ELBO.
        elbo = pyro.infer.Trace_ELBO()

        # Stochastic Variational Inference (SVI).
        optimizer = pyro.optim.Adam({'lr': learning_rate})
        svi = pyro.infer.SVI(self._model, self._guide, optimizer, elbo)

        # Run Stochastic Gradient Descent.
        self.training_losses = []
        for step in range(num_iterations):
            # Calculate the loss and take a gradient step.
            loss = svi.step(x_data, y_data)
            self.training_losses.append(loss)
            if step % 100 == 0:
                print(f'ELBO loss [iteration {step + 1}] loss: {loss:.4f}')

        # Examine the optimized guide parameter values from the trained guide by fetching from Pyro’s param store.
        self._guide.requires_grad_(False)
        for name, value in pyro.get_param_store().items():
            print(f'{name}: {pyro.param(name).data.cpu().numpy()}')

    def predict(self,
                x_data: torch.Tensor,
                num_samples: int = 800) -> Dict[Any, Dict[str, Union[Tensor, Any]]]:
        """Makes predictions for new observations.

        Parameters
        ----------
        x_data
            Tensor of input features.
        num_samples
            The number of samples to generate from the trained model for estimating the regression line and posterior
            predictive distribution.

        Returns
        -------
        pred_summary
            DataFrame including predictions from the fitted regression line (mean estimate) as well as estimated
            percentiles from the posterior predictive distribution.
        """
        # Reset the random number generator.
        pyro.set_rng_seed(self.random_state)

        # Generate Samples from the Posterior Predictive Distribution.
        predictive = pyro.infer.Predictive(
            self._model, guide=self._guide, num_samples=num_samples, return_sites=('obs', '_RETURN'), parallel=True)

        posterior_samples = predictive(x_data)
        pred_summary = self.summary(posterior_samples)

        return pred_summary

    @staticmethod
    def summary(samples) -> Dict[Any, Dict[str, Union[Tensor, Any]]]:
        site_stats = {}
        for k, v in samples.items():
            site_stats[k] = {
                "mean": torch.mean(v, 0),
                "std": torch.std(v, 0),
                "5%": v.kthvalue(int(len(v) * 0.05), dim=0)[0],
                "95%": v.kthvalue(int(len(v) * 0.95), dim=0)[0],
            }
        return site_stats


DATA_URL = "https://d2hg8soec8ck9v.cloudfront.net/datasets/rugged_data.csv"
data = pd.read_csv(DATA_URL, encoding="ISO-8859-1")
df = data[["cont_africa", "rugged", "rgdppc_2000"]]
df = df[np.isfinite(df.rgdppc_2000)]
df["rgdppc_2000"] = np.log(df["rgdppc_2000"])

df["cont_africa_x_rugged"] = df["cont_africa"] * df["rugged"]
data = torch.tensor(df[["cont_africa", "rugged", "cont_africa_x_rugged", "rgdppc_2000"]].values, dtype=torch.float)
x_data, y_data = data[:, :-1], data[:, -1]

blr = BayesianRegressionModel()
blr.fit(x_data, y_data)
y_pred = blr.predict(x_data)

Thank you! Above is my original implementation of linear_regression_model() without the use of plates for all the sample statements. This one seemed to allow Predictive(parallel=True) to work but isn’t batching the other sample sites properly with psis_diagnostic().

this seems to work. though personally i haven’t found the psis diagnostic all that useful. for most problems where you want to use variational inference for scalability reasons the posterior approximation is mediocre at best and the psis diagnostic will be bad. if you really care about high quality posterior estimates you should do mcmc

import torch
import pyro
import pyro.distributions as dist
import numpy as np
import pandas as pd
from torch import Tensor

from dataclasses import dataclass
from typing import List, Dict, Optional, Union, Any

from pyro.infer.autoguide import init_to_mean

pyro.enable_validation(True)


@dataclass
class PriorParams:
    """Class for specifying the parameters of the prior distributions used in the linear regression model."""
    # Regression coefficient priors.
    weight_loc: float = 0.0
    weight_scale: float = 1.0

    # Bias term prior.
    bias_loc: float = 0.0
    bias_scale: float = 1.0

    # Observation noise prior.
    sigma_scale: float = 10.0


def linear_regression_model(x: torch.Tensor, y: torch.Tensor = None, prior_params: PriorParams = None) -> torch.Tensor:
    """Implements a Bayesian model for linear regression that puts priors on the model parameters and observation noise.

    Parameters
    ----------
    x
        The input feature matrix of shape: (num_examples, num_features).
    y
        The response/target variable vector of size: num_examples.
    prior_params
        A 'dataclass' that stores the parameter values of the prior distributions used in the model.
        Note: If `prior_params` is None, then the default values from the class are used.

    Returns
    -------
    mean
        The regression line.
    """
    if prior_params is None:
        prior_params = PriorParams()

    # Regression coefficient priors.
    weight = pyro.sample(
        'weight', dist.Normal(prior_params.weight_loc, prior_params.weight_scale).expand([x.shape[1]]).to_event(1))

    # Bias term prior (aka the intercept or independent coefficient).
    bias = pyro.sample('bias', dist.Normal(prior_params.bias_loc, prior_params.bias_scale).expand([1]).to_event(1))

    # Observation noise prior (aka the random error in the response variable 'y').
    sigma = pyro.sample('sigma', dist.HalfNormal(scale=prior_params.sigma_scale).expand([1]).to_event(1))

    # Calculate the expected mean. This is a linear combination of the input features, weight matrix and bias vector.
    mean = torch.matmul(x, weight.unsqueeze(-1)).squeeze(-1) + bias

    # Plate context manager (vectorized).
    pyro.sample('obs', dist.Normal(mean, sigma).to_event(1), obs=y)

    return mean


class BayesianRegressionModel:
    def __init__(self, random_state: int = 42):
        """Bayesian Regression Model.

        Attributes
        ----------
        random_state
            Seed used by the random number generator. This allows for repeatable sampling procedures.
        """

        # Set the seed used by the random number generator.
        self.random_state = random_state

        # Model evaluation metrics.
        self.training_losses: Optional[List[float]] = None

        # Pyro-specific attributes.
        self._model = None
        self._guide = None

    def fit(self,
            x_data: torch.Tensor,
            y_data: torch.Tensor,
            num_iterations: int = 1000,
            learning_rate: float = 0.001) -> None:
        """Fits the Bayesian Regression Model.

        Parameters
        ----------
        x_data
            Tensor of input features.
        y_data
            Tensor of observation values.
        num_iterations
            The number of iterations to run stochastic variational inference (SVI) for.
        learning_rate
            The learning rate used for optimizing the Evidence Lower Bound (ELBO) objective.

        Returns
        -------
        None
        """
        # Reset the random number generator.
        pyro.set_rng_seed(self.random_state)

        # Linear Regression Model.
        self._model = linear_regression_model

        # Pyro Guide.
        self._guide = pyro.infer.autoguide.AutoNormal(self._model, init_loc_fn=init_to_mean)

        # Variational Inference.
        pyro.clear_param_store()

        # ELBO.
        elbo = pyro.infer.Trace_ELBO()

        # Stochastic Variational Inference (SVI).
        optimizer = pyro.optim.Adam({'lr': learning_rate})
        svi = pyro.infer.SVI(self._model, self._guide, optimizer, elbo)

        # Run Stochastic Gradient Descent.
        self.training_losses = []
        for step in range(num_iterations):
            # Calculate the loss and take a gradient step.
            loss = svi.step(x_data, y_data)
            self.training_losses.append(loss)
            if step % 100 == 0:
                print(f'ELBO loss [iteration {step + 1}] loss: {loss:.4f}')

        # Examine the optimized guide parameter values from the trained guide by fetching from Pyro’s param store.
        self._guide.requires_grad_(False)
        for name, value in pyro.get_param_store().items():
            print(f'{name}: {pyro.param(name).data.cpu().numpy()}')

    def predict(self,
                x_data: torch.Tensor,
                num_samples: int = 800) -> Dict[Any, Dict[str, Union[Tensor, Any]]]:
        """Makes predictions for new observations.

        Parameters
        ----------
        x_data
            Tensor of input features.
        num_samples
            The number of samples to generate from the trained model for estimating the regression line and posterior
            predictive distribution.

        Returns
        -------
        pred_summary
            DataFrame including predictions from the fitted regression line (mean estimate) as well as estimated
            percentiles from the posterior predictive distribution.
        """
        # Reset the random number generator.
        pyro.set_rng_seed(self.random_state)

        # Generate Samples from the Posterior Predictive Distribution.
        predictive = pyro.infer.Predictive(
            self._model, guide=self._guide, num_samples=num_samples, return_sites=('obs', '_RETURN'), parallel=True)

        posterior_samples = predictive(x_data)
        pred_summary = self.summary(posterior_samples)

        return pred_summary

    @staticmethod
    def summary(samples) -> Dict[Any, Dict[str, Union[Tensor, Any]]]:
        site_stats = {}
        for k, v in samples.items():
            site_stats[k] = {
                "mean": torch.mean(v, 0),
                "std": torch.std(v, 0),
                "5%": v.kthvalue(int(len(v) * 0.05), dim=0)[0],
                "95%": v.kthvalue(int(len(v) * 0.95), dim=0)[0],
            }
        return site_stats


DATA_URL = "https://d2hg8soec8ck9v.cloudfront.net/datasets/rugged_data.csv"
data = pd.read_csv(DATA_URL, encoding="ISO-8859-1")
df = data[["cont_africa", "rugged", "rgdppc_2000"]]
df = df[np.isfinite(df.rgdppc_2000)]
df["rgdppc_2000"] = np.log(df["rgdppc_2000"])

df["cont_africa_x_rugged"] = df["cont_africa"] * df["rugged"]
data = torch.tensor(df[["cont_africa", "rugged", "cont_africa_x_rugged", "rgdppc_2000"]].values, dtype=torch.float)
x_data, y_data = data[:, :-1], data[:, -1]

blr = BayesianRegressionModel()
blr.fit(x_data, y_data)
y_pred = blr.predict(x_data)

k_hat = pyro.infer.importance.psis_diagnostic(blr._model, blr._guide, x_data, y_data, num_particles=500000,
                                              max_plate_nesting=0)
print("k_hat", k_hat)

Thank you so much for all your help, I really appreciate it! Everything runs perfectly now. Looking at your example, I can see where I got things mixed up. Also, thank you for the advice on psis for VI vs. MCMC. I’ve been working through Statistical Rethinking and was eager to try out psis but I wasn’t aware of that difference. Thank you again!

Just wanted to say how much I enjoy using Pyro. It’s really amazing.

Thank you again for your help with this problem! One follow-up question I have is regarding the sampling of the observation noise parameter “sigma”. When I run the model with the same input feature vectors, the quantiles for the “mean” (i.e. the “_RETURN” site values) are identical. However, the quantiles for the outcome variable (i.e. the “obs” site values) are slightly different. My understanding is that the 800 samples drawn from the autoguide will instantiate 800 different linear regression functions. So it makes sense that passing the same input values through those 800 different functions would yield the same distribution of values for each regression line. But is there a way to control how the sigma values are sampled such that, for the same inputs, the distribution of the outcome variable is also identical?

do you mean sharing observation noise across different inputs or something?

More that the observation noise would be the same for the same inputs. For example, in the Bayesian Regression tutorial, two African countries with the same Log GDP and Terrain Ruggedness Index values would get the same “y_perc_5” and “y_perc_95” values from predictive = Predictive(model, guide=guide, num_samples=800, return_sites=(“linear.weight”, “obs”, “_RETURN”)).

from pyro.infer import Predictive


def summary(samples):
    site_stats = {}
    for k, v in samples.items():
        site_stats[k] = {
            "mean": torch.mean(v, 0),
            "std": torch.std(v, 0),
            "5%": v.kthvalue(int(len(v) * 0.05), dim=0)[0],
            "95%": v.kthvalue(int(len(v) * 0.95), dim=0)[0],
        }
    return site_stats


predictive = Predictive(model, guide=guide, num_samples=800,
                        return_sites=("linear.weight", "obs", "_RETURN"))
samples = predictive(x_data)
pred_summary = summary(samples)

mu = pred_summary["_RETURN"]
y = pred_summary["obs"]
predictions = pd.DataFrame({
    "cont_africa": x_data[:, 0],
    "rugged": x_data[:, 1],
    "mu_mean": mu["mean"],
    "mu_perc_5": mu["5%"],
    "mu_perc_95": mu["95%"],
    "y_mean": y["mean"],
    "y_perc_5": y["5%"],
    "y_perc_95": y["95%"],
    "true_gdp": y_data,
})

Like if x_data[0, :] and x_data[1, :] were identical feature vectors, then the “y_perc_5” and “y_perc_95” would be the same for each input. Currently, the “mu_perc_5” and “mu_perc_95” values are the same given identical input feature vectors.

It’s more my lack of understanding about how the “num_samples” parameter in the Predictive class works. My assumption was that setting num_samples=800 would instantiate 800 different linear regression models (from 800 samples of the posterior distributions for the regression coefficients and bias). And that seems to be the case because the quantiles for the estimate of the regression line are the same for the same inputs (i.e. the 800 individual regression lines don’t change). I also assumed that each of the 800 samples from the posterior distribution for sigma would corresponding to one of the 800 different linear regression models.

num_samples = 800
for i in range(0, num_samples):
    mean_i = (weight1[i] * x_1) + (weight2[i] * x_2) + bias[i]
    obs = pyro.sample('obs', dist.Normal(mean_i, sigma[i])

Is the reason that the quantiles for the outcome variable (“obs”) are different for the same inputs is that an new sampling occurs in the pyro.sample(‘obs’, dist.Normal(mean_i, sigma[i]) statement for each new input?

yes the observation noise is sampled anew for each sample. if you want to share observation noise across different samples you will need to do that manually. e.g. use predictive to compute the means of each sample but then compute add observation noise with a term of the form sigma[i] * epsilon where sigma[i] ranges across different posterior samples but epsilon is a unit normal random variable that is shared across data points. of course in practice you’d need to do that many times, e.g. epsilon[j] where j ranges across some additional outer loop of sampling. you can then compute percentiles and what not of the computed ys

Thank you so much! I greatly appreciate all your help. I will try that out.