How to define a likelihood function in numpyro?

samarth-kashyap · June 2, 2021, 12:37pm

I’m very new to numpyro and I’ve been trying to implement a simple linear regression using numpyro, just to get an understanding of how it works. Here’s a snippet of the code I’ve written based on the documentation provided.

import jax
from jax import random
from jax import grad, jit
# jax numpy is imported as np NOT numpy
import jax.numpy as np
import numpyro
import numpyro.distributions as dist
from numpyro.infer import MCMC, NUTS, HMC
import scipy.sparse
import sys
import numpy

# Choose the "true" parameters.
m_true = -0.9594
b_true = 4.294
f_true = 1.0

# initializing parameters                                                
N = 50; J = 1
X = random.normal(random.PRNGKey(seed=123), (N, J))
weight = np.array([m_true])
error = 0.1 * random.normal(random.PRNGKey(234), (N, ))
y_obs = f_true * (X @ weight + b_true) + error
y = y_obs.reshape((N, 1))

def model(X, y=None):
    ndims = np.shape(X)[-1]
    ws = numpyro.sample('betas', dist.Normal(0.0, 10.0*np.ones(ndims)))
    b = numpyro.sample('b', dist.Normal(0.0, 10.0))
    sigma = numpyro.sample('sigma', dist.Uniform(0.0, 10.0))
    f = numpyro.sample('f', dist.Normal(0.0, 10.0))
   mu = f * (X @ ws + b)                      
   return numpyro.sample("y", dist.Normal(mu, sigma), obs=y)

nuts_kernel = NUTS(model)
num_warmup, num_samples = 500, 1500
mcmc = MCMC(nuts_kernel, num_warmup, num_samples, num_chains=1)
mcmc.run(random.PRNGKey(4), X, y = y_obs)
mcmc.print_summary()

The model is of the form y = f*(m*x + b). My question is that how is the likelihood function defined? The function “model”, I presume, defines probabilistic variables for all the model parameters. Does the return statement in the model correspond to the posterior probability distribution? How does numpyro know what the standard errors in my data are? How do I specify that?

Elchorro · June 2, 2021, 3:31pm

There are people on this forum that can explain things better but let me make a couple points.

(1) In the model it should be mu = f * (X @ ws + b) , I believe. Otherwise f is not used.
(2) If each independent observation is a scalar, then y should simply be a vector of dimensions (N,). There is an excellent tutorial on pyro tensor shapes. Numpyro follows the logic.
(3) The likelihood of the model is not identified. Notice you can multiply f by a constant and divide ws and b by the same constant and mu would be unaffected. If likelihood is not identified, the priors in bayesian models become extremely important and their importance does not diminish when sample sample size increases.
(4) All the the return statement does in this model is that it returns a draw from Normal(mu, sigma). That is, if you run numpyro.handlers.seed(model,random.PRNGKey(234))(X), the model will first sample for all latent sites (betas, b, sigma, f) and finally for the site y and return it to you.
(5) If you want to obtain log-likelihood you can do this:

ll = numpyro.infer.util.log_likelihood(
         model = model, 
         posterior_samples = {"betas": m_true * np.ones((1,1)), "b" : np.array([b_true]), "f" : np.array([f_true]), "sigma" : np.array([0.1])},
         X = np.array(X), 
         y = y_obs)
ll["y"].mean()

(6) Not sure what you mean by “standard errors”. It’s a bayesian model. You specify the variance of observations to be sigma and infer it. If you would like to allow for correlations between observations you would have to model it explicitly.

samarth-kashyap · June 2, 2021, 5:06pm

Thanks! Some clarifications

(1) the model is mu = f * (X @ ws + b). I was trying to see if I obtain a solution with the ``f``` factor, so I must’ve missed adding that in the code snippet I sent here. I have edited the question appropriately.

(2) Each observation is a scalar for now. There is a variable J which represents the size of the vector. It’s set to be 2 for now, in order to get the simplest case to work. The array shape is specified to be (N, J). Thanks for the tutorial link. I’ll go through it in greater detail shortly.

(3) I’m still not clear on this. Could you please elaborate on this? Since the stepping in HMC depends on the likelihood function, how does numpyro arrive at the posterior? My crude understanding is that, the user specifies the priors for all the parameters and the likelihood function. The NUTS sampler uses the priors to obtain markov chains that sample the parameter space. If the likelihood function isn’t provided by the user, how is the NUTS sampler able to navigate through the parameter space?

(6) Yes, when I said “standard error” I meant data variance. For now we assume that the covariance matrix is diagonal.

Thanks for the info regarding tensor shapes. Here’s a modified code that seems to work. But the question on the likelihood function still remains.

# Choose the "true" parameters.
m_true = -0.9594
b_true = 4.294
f_true = 0.534

# initializing parameters                                                
N = 50; J = 2
X = random.normal(random.PRNGKey(seed = 123), (N, J))
weight = np.array([m_true, 10*m_true])
error = 1e-3 * random.normal(random.PRNGKey(234), (N, )) # standard Normal
                                              
y_obs = f_true * (X @ weight + b_true) + error

# setting up model                                                         
def model(X, y=None):
    ndims = np.shape(X)[-1]
    ws = numpyro.sample('betas', dist.Normal(0.0,10.0*np.ones(ndims)))
    b = numpyro.sample('b', dist.Normal(0.0, 10.0))
    sigma = numpyro.sample('sigma', dist.Uniform(0.0, 10.0))
    f = numpyro.sample('f', dist.Normal(0.0, 2.5))
    mu = f * (X @ ws + b)
    return numpyro.sample("y", dist.Normal(mu, sigma), obs=y)

# setting up the sampler                                                   
nuts_kernel = NUTS(model)
num_warmup, num_samples = 500, 1500
mcmc = MCMC(nuts_kernel, num_warmup, num_samples, num_chains=1)

# sampling                                                                 
mcmc.run(random.PRNGKey(240), X, y = y_obs)

# printing the NUTS summary
print(mcmc.print_summary())

Elchorro · June 2, 2021, 5:44pm

I should say, I am not a Numpyro dev, so treat my answers with a pinch of salt. Maybe @fehiepsi will want to clarify my mistakes or have something to add.

(2) Right. Keep X as is (N,1) (so that matrix multiplication works). I just meant that Y is (N,) (i.e) a scalar vector. Your code looks good now.

(3) Just to clarify, when I said the likelihood is not identified, I meant it in a mathematical sense (there exist multiple points in parameter space where likelihood is identical). This has nothing to do with numpyro or MCMC.

You are right, a user provides priors and in each iteration the algorithm will take draws from these priors. But in your code you also provided likelihood. The key is this statement sample("y", dist.Normal(mu, sigma), obs=y) It tells the algorithm to evaluate the log-likelihood of observing obs = y if the latent variables were set at the current iteration’s values.

An alternative (and perhaps more explicit) way to add to log_likelihood is to use factor. For example, for the purpose of inference, sample("y", dist.Normal(mu, sigma), obs=y) could be replaced with factor("y", dist.Normal(mu, sigma).log_prob(y)). This again means to create an instance of dist.Normal with paramaters mu, sigma and use it’s method log_prob to compute log likelihood of seeing y.

If you wanted you could code the log-likelihood of Normal distribution by hand and use it inside factor. But it is already implemented as a method of dist.Normal so there is no need to do it (but in some models, it may be necessary if distributions are not yet implemented).

(6) The way you model the dependence is up to you. Suppose you have L measurements for each of N units (so Y.shape = (N,L)) you may then want to replace Normal with MultivariateNormal and scalar sigma with a covariance matrix Sigma. Although, make sure to review the tutorial on shapes. In this example you would want to ensure that batch_shape is N and event_shape is L.

I hope this clarifies it.

fehiepsi · June 2, 2021, 7:48pm

Thanks, @Elchorro, for your detailed answer! I just have a few points to elaborate:

The function model defines a generative model for your observed data. It corresponds 1-1 to a graphical model. The likelihood function is encoded in your observed sites: the numpyro.sample sites with obs. It corresponds to the grey circle node in a graphical model.
You can return whatever you want because model is just a Python function. E.g. you can return ws, b, sigma.
Given a generative model p(z, x) (where z is a latent variable, x is an observed variable), HMC will generate samples of p(z|x) = p(z, x) / p(x) ~ p(z, x) = p(z)p(x|z). Here the prior p(z) is calculated from sample statement without obs. And the likelihood p(x|z) is calculated from sample statement with obs. In your example, the likelihood is jnp.exp(dist.Normal(mu, sigma).log_prob(y)).

samarth-kashyap · June 3, 2021, 4:09am

Thank you @Elchorro and @fehiepsi ! I am beginning to get some clarity.

@fehiepsi, while I understand that model is just a python function and can return anything, the reason I asked that is because model is used to build the NUTS kernel. Hence, I was wondering if something specific needs to be returned. Does the kernel look for all sample statements and decide whether the given statement is a prior or not based on the obs keyword?

If I want to define the likelihood function to be $e^{-(model - data)^2/variance}$, do I simply change the sample call to numpyro.sample("y", dist.Normal(mu, 1e-6*np.ones_like(y)), obs=y)? (In the synthetic data used for the problem, the data is taken to have a standard deviation of 1e-3 for all data points)

fehiepsi · June 3, 2021, 4:32am

You are right. With dist.Normal(mu, 1e-6*np.ones_like(y)), the likelihood function is exp(-(mu - y)^2 / 1e-12).

samarth-kashyap · June 3, 2021, 4:55am

Great. Thanks a ton @fehiepsi and @Elchorro!

It’s very encouraging to see both developers and users being so responsive and helpful. Definitely nudges me to continue using pyro and hopefully contribute to the codebase one day.