Trouble with basic parameter estimation with the Stable distribution

mattw · September 22, 2023, 10:56pm

I’m trying to make use of Pyro’s implementation of Stable distributions. As a first sanity check I’m trying to recover a set of known parameters via MLE, but I can’t get it to find the correct values. I have a minimum example:

import torch
import matplotlib.pyplot as plt

import pyro
import pyro.distributions as dist
from pyro.distributions import constraints
from pyro.infer import SVI, Trace_ELBO
from pyro.infer.reparam import MinimalReparam

device = "cuda"

# Define true parameters and number of datapoints
alpha = 1.4
beta = 0.8
c = 1.1
mu = 3
n = 1000000

# sample data
with torch.device(device):
    data = dist.Stable(alpha, beta, c, mu).sample((n,))

pyro.clear_param_store()

# define a simple model
@MinimalReparam()
def simple_model(data):    
    alpha = pyro.param("alpha", torch.tensor(1.3), constraint=constraints.interval(0, 2))
    beta = pyro.param("beta", torch.tensor(0.7), constraint=constraints.interval(-1, 1))
    c = pyro.param("c", torch.tensor(1.0), constraint=constraints.positive)
    mu = pyro.param("mu", torch.tensor(2.9), constraint=constraints.real)
    
    with pyro.plate("data", data.shape[0]):
        pyro.sample("obs", dist.Stable(alpha, beta, c, mu), obs=data)

# set up Autoguide, ELBO, and optimizer
with torch.device(device):
    guide = pyro.infer.autoguide.AutoDelta(simple_model)
    elbo = Trace_ELBO()
    elbo.loss(simple_model, guide, data=data)

num_steps = 10001
with torch.device(device):
    optim = pyro.optim.ClippedAdam({"lr": 0.01})
    scheduler = pyro.optim.CosineAnnealingLR({"optimizer": optim, "optim_args": {"lr": 0.01, "T_max": num_steps}})
    svi = SVI(simple_model, guide, optim, loss=elbo)

# optimize
losses = []
for i in range(num_steps):
    loss = svi.step(data)
    losses.append(loss)

print(f"Parameter estimates (n = {n}):")
print(f"alpha: Estimate = {pyro.param('alpha')}, true = {alpha}")
print(f"beta: Estimate = {pyro.param('beta')}, true = {beta}")
print(f"c: Estimate = {pyro.param('c')}, true = {c}")
print(f"mu: Estimate = {pyro.param('mu')}, true = {mu}")

Results:

Parameter estimates (n = 1000000):
alpha: Estimate = 0.3840053975582123, true = 1.4
beta: Estimate = 0.6137597560882568, true = 0.8
c: Estimate = 1.2401551008224487, true = 1.1
mu: Estimate = 2.395522117614746, true = 3

With 1M datapoints, SVI converges to a parameter set significantly worse than the initial guesses I gave to the pyro.param() calls. What am I getting wrong here? Am I messing something up with the reparameterizer?

martinjankowiak · September 23, 2023, 12:15am

we’ve had some difficulty getting the stable distributions to work reliably so your mileage may vary. @fritzo may have more detailed insight

fritzo · September 23, 2023, 6:20pm

Stable inference is very difficult for small alpha, including not only inference when alpha is known to be small, but also inference where it is unknown whether alpha is small. I’ve had best luck using very conservative lower bounds for my alpha parametrization, e.g. if I know alpha is around 1.8, then I might constrain to >= 1.7. I’d also recommend initializing to an over estimate like 1.9 or 1.95 or 1.99.

alpha = pyro.param("alpha", torch.tensor(1.95),  # Initialize to an over-estimate
                   constraint=constraints.interval(1.7, 2.0))  # Conservatively lower bound

In your case you might use a lower bound of 1.0 or even 1.1 and initialize to 1.9.

More generally, gradient based is great in that it applies to so many problems, but it is also near-sighted and greedy, so it really helps to give it good initialization and guard rails.

martinjankowiak · September 23, 2023, 7:49pm

also looking at your optimizer using ClippedAdam with the default of clip_norm=10.0 may be a bad idea in your case since you have 10^6 datapoints. you probably want a much less aggressive level of clipping e.g. clip_norm=1.0e3 or even higher

mattw · September 23, 2023, 9:50pm

Thanks for the tip about the scaling on the ClippedAdam clip scale, I hadn’t thought about needing to scale the max grad norm with the number of datapoints.

I tried rerunning this little MLE problem in the upper region of the alpha space as suggested, but unfortunately I’m still getting similar behavior. Setting the true value to 1.8, the initial guess to 1.95, and the constraint bound to (1.7, 2.0), the converged value is generally coming out at the bottom of the constraint region, i.e. around 1.7.

Updated code snippet:

import torch
import matplotlib.pyplot as plt

import pyro
import pyro.distributions as dist
from pyro.distributions import constraints
from pyro.infer import SVI, Trace_ELBO
from pyro.infer.reparam import MinimalReparam

device = "cuda"

# Define true parameters and number of datapoints
alpha = 1.8
beta = 0.8
c = 1.1
mu = 3
n = int(1e6)

# Sample data
with torch.device(device):
    data = dist.Stable(alpha, beta, c, mu).sample((n,))

pyro.clear_param_store()

# define the simple model
@MinimalReparam()
def simple_model(data):    
    alpha = pyro.param("alpha", torch.tensor(1.95), constraint=constraints.interval(1.7, 2.))
    beta = pyro.param("beta", torch.tensor(0.7), constraint=constraints.interval(-1., 1.))
    c = pyro.param("c", torch.tensor(1.0), constraint=constraints.positive)
    mu = pyro.param("mu", torch.tensor(2.9), constraint=constraints.real)
    
    with pyro.plate("data", data.shape[0]):
        pyro.sample("obs", dist.Stable(alpha, beta, c, mu), obs=data)

# Set up autoguide, ELBO, and optimizer
with torch.device(device):
    guide = pyro.infer.autoguide.AutoDelta(simple_model)

with torch.device(device):
    elbo = Trace_ELBO()
    elbo.loss(simple_model, guide, data=data)

num_steps = 5001
with torch.device(device):
    optim = pyro.optim.ClippedAdam({"lr": 0.01, "clip_norm": 1e4})
    scheduler = pyro.optim.CosineAnnealingLR({"optimizer": optim, "optim_args": {"lr": 0.01, "T_max": num_steps, "clip_norm": 1e4}})
    svi = SVI(simple_model, guide, optim, loss=elbo)

# Optimize
losses = []

for i in range(num_steps):
    loss = svi.step(data)
    losses.append(loss)

print(f"Parameter estimates (n = {n}):")
print(f"alpha: Estimate = {pyro.param('alpha')}, true = {alpha}")
print(f"beta: Estimate = {pyro.param('beta')}, true = {beta}")
print(f"c: Estimate = {pyro.param('c')}, true = {c}")
print(f"mu: Estimate = {pyro.param('mu')}, true = {mu}")

Sample results:

Parameter estimates (n = 1000000):
alpha: Estimate = 1.7137782573699951, true = 1.8
beta: Estimate = 0.9999985694885254, true = 0.8
c: Estimate = 1.202109932899475, true = 1.1
mu: Estimate = 2.61881947517395, true = 3

fritzo · September 26, 2023, 10:19pm

Hmm, a beta of 0.8 is pretty skewed. It looks like the estimator thinks your data is totally skewed. I guess I’d try (1) initializing beta = 0.0, (2) trying double precision of you aren’t already, via torch.set_default_dtype(torch.float64), and (3) experimenting with behavior as sample count increases, from say 1e2 up to say to 1e7 or 1e8.

I find that my own intuition was built on decades of Gaussian education, and that my intuition has trouble estimating how many samples are needed to estimate heavy-tailed parameters: you need a lot more data than the usual 1/\epsilon^2 in the Gaussian case.

mattw · September 28, 2023, 5:10pm

I don’t think it’s a preasymptotic sample size issue, because scipy’s implementation of Stable distribution fitting is able to produce better estimates with much fewer samples (though taking much longer of course). With sample size 1e4,

Pyro output (same program from before):

Parameter estimates (n = 10000):
alpha: Estimate = 1.7075034379959106, true = 1.8
beta: Estimate = 0.9999992847442627, true = 0.8
c: Estimate = 1.0035901069641113, true = 1.1
mu: Estimate = 2.471940279006958, true = 3

Scipy:

from scipy.stats import levy_stable
import time

start = time.time()
params = levy_stable.fit(data.cpu().numpy())
stop = time.time()
print(params)
print([alpha, beta, mu, c])
print(f"Time elapsed: {stop - start} s")

(1.8070363609221514, 0.8569286711710157, 3.2703706927729344, 1.1080908182649822)
[1.8, 0.8, 3, 1.1]
Time elapsed: 1755.4336349964142 s

Right now I’m working on implementing direct PDF evaluation via numerical integration with the Torchquad library. We’ll see how well backpropping through the integration works.

fritzo · September 28, 2023, 8:01pm

Oh I think I see the issue, let me explain. Pyro’s Stable inference uses auxiliary variable methods via poutine.reparam (your MinimalReparam above). And since the auxiliary variables must be integrated out to compute a marginal density of data, Pyro doesn’t really support exact ML estimation of parameters. However Pyro does support other types of inference of Stable parameters:

You can compute approximate ML estimated parameters using variational inference and a nontrivial guide, e.g. using AutoNormal instead of the AutoDelta above.
You can Monte Carlo estimate a posterior distribution over parameters (given a flat prior) using HMC or NUTS; these may be slow but are asymptotically correct.

Note that originally you were using an AutoDelta for all parameters, including the automatically introduced auxiliary parameters (introduced here). So your weird estimates aren’t ML estimates of your original model, but instead ML estimates of the extended model. What you want in auxiliary variable methods is to maximize over your parameters while integrating out the auxiliary latent variables, but the AutoDelta instead point estimates everything at once. By replacing the AutoDelta with an AutoNormal, you can at least approximately integrate out those auxiliary latents.

Does that make sense? And thanks for trying out our Stable machinery. I’m a big fan of Stable distributions, and I’d like more people use them, so we welcome any improvements to code or docs that you can suggest or contribute. Cheers.

mattw · September 28, 2023, 9:29pm

Thanks for the explanation. I had seen the code in StableReparam you linked, but don’t quite understand the method implemented there, or why an rsampled version of a single Chambers-Mallows-Stuck random variable (instead of two) wouldn’t work.

Interestingly, swapping out the AutoDelta with AutoNormal still gives a wrong estimate, in this case by putting beta at 0.

Parameter estimates (n = 1000000):
alpha: Estimate = 1.8298276662826538, true = 1.8
beta: Estimate = -0.0001965165138244629, true = 0.8
c: Estimate = 1.1153184175491333, true = 1.1
mu: Estimate = 3.145782470703125, true = 3

Is this because AutoNormal puts a Gaussian prior on beta \in [-1, 1] so the prior is concentrated around 0?

fritzo · September 28, 2023, 11:10pm

Hmm, how would you use just a distribution’s rsample method to fit its parameters to data, even for say Normal(loc,scale).rsample() to normal data?

Well AutoNormal doesn’t put a prior on anything. Rather it fits a univariate variational posterior to each latent variable. In your case there are I think 4 latent variables (per data point), where 2 are exponential and 2 are uniform. AutoNormal will fit lognormal distributions to the exponentials (i.e. normal with an exp() link function), and will fit logit-normal distributions to the uniforms (i.e. with a sigmoid link function).

I don’t know why the beta is off in Pyro’s estimate. I thought the most likely culprit might be the variational approximation, but even a multivariate normal guide doesn’t seem to work, again resulting in beta near zero. I suppose there could be a bug or numerical instability in beta, as I’ve mostly tested symmetric stable distributions which use a simpler reparametrization with 2 auxiliary variables. LMK if you have any other ideas

mattw · September 29, 2023, 3:56pm

I dunno, I just meant that I had this vague idea that pathwise derivatives and the reparameterization trick solved everything before I started messing around with a deep dive of the stable distribution here.

Regarding other ideas, I’m still working on seeing whether directly computing the PDF to fit the distribution can be fit through the log_prob() method is practical. I’ll post an update when I have something.

fehiepsi · October 6, 2023, 5:16pm

I spent a couple of hours playing with @fritzo’s notebook and found that we can’t estimate beta properly even when fixing other parameters alpha, c, mu. I suspect that there’s something wrong with StableReparam. It would be helpful to create a github issue to track this problem.

fritzo · October 6, 2023, 5:52pm

@fehiepsi I wonder if PyTorch 2.0 broke something? @eb8680_2 found that a number of Stable reparam tests have broken since PyTorch 2: Some distribution tests fail under PyTorch 2.0 · Issue #3214 · pyro-ppl/pyro · GitHub

mattw · October 11, 2023, 6:44pm

I’ve implemented a log_prob() for Stable distributions. It uses the Torchquad library to compute the integral formulas given by Nolan. I uploaded the code to this repo with a demo notebook here.

The demo shows a similar MLE example that is optimized by backpropping through the log-likelihood calculation. I haven’t tried putting it in a fully fleshed-out Pyro program or deep network yet.

The integral calculation and the backward are pretty fast and seem to be numerically stable, but it’s possible there are still kinks to work out and can likely be further optimized.