Developing a deeper intuition for HMC, connecting the underlying intuition with hamiltons equations

I’m trying to develop a deeper understanding of HMC and how the sampling in Pyro(HMC) works, but it has proven to be difficult given the two main sources (Neal, Betancourt) that are out there. Most other tutorials are just copies of these two resources without providing any extra intuitions. I believe many of the questions I will ask below are commonly thought by other HMC beginners. I also believe that someone who has a thorough understanding of this algorithm will be able to see my line of reasoning and quickly identify where it goes wrong. Hopefully, the answers to my questions will aid other practitioners starting out with HMC.

I will start by describing the algorithm and some intuitions, then present two physical analogies, and finally pose my questions.

Algorithm and Intuition

HMC is typically used to sample from the posterior distribution of parameters \mathbf{x} given data \mathbf{D}. According to Bayes’ theorem, the posterior is:

p(\mathbf{x} \mid \mathbf{D}) = \frac{p(\mathbf{D} \mid \mathbf{x}) p(\mathbf{x})}{p(\mathbf{D})},

where:

  • p(\mathbf{D} \mid \mathbf{x}) is the likelihood of the data,
  • p(\mathbf{x}) is the prior,
  • p(\mathbf{D}) = \int p(\mathbf{D} \mid \mathbf{x}) p(\mathbf{x}) \, d\mathbf{x} = E_{\mathbf{x}}[p(\mathbf{D} \mid \mathbf{x})] is the marginal likelihood (normalization constant).

We aim to sample from p(\mathbf{x} \mid \mathbf{D}) . I like to explicitly state the normalization constants. Betancourt’s focus is on expectations, likely because almost any analysis we want to conduct can be operationalized through the computation of an expectation. Related to this is the typical set, which is where the probability mass p(x) \, dx dominates. This is an abstract concept that can be hard to understand. The main idea is that as we increase dimensions, the volume dx starts to dominate p(x) outside the neighborhood of the mode.

The idea behind HMC is to construct a vector field that aligns with this typical set. To do this, we utilize the gradients of our target distribution. We then need momentum to move around this landscape so that we don’t crash into the mode (since gradients point toward the mode). We also need to carefully add this momentum to stay within the typical set without drifting away into regions of low p(x) \, dx .

In HMC, the idea is to define a Hamiltonian and its corresponding equations for how the system evolves. The Hamiltonian is given by:

H(\mathbf{x}, \mathbf{p}) = U(\mathbf{x}) + K(\mathbf{p}),

where U(\mathbf{x}) = -\log p(\mathbf{x} \mid \mathbf{D}) is the potential energy, and K(\mathbf{p}) = \frac{1}{2} \mathbf{p}^\top M^{-1} \mathbf{p} is the kinetic energy, with momentum \mathbf{p} and mass matrix (M).

Hamilton’s equations are:

\frac{d\mathbf{x}}{dt} = \frac{\partial H}{\partial \mathbf{p}}, \quad \frac{d\mathbf{p}}{dt} = -\frac{\partial H}{\partial \mathbf{x}}.

Explicitly, these become:

\frac{d\mathbf{x}}{dt} = M^{-1} \mathbf{p}, \quad \frac{d\mathbf{p}}{dt} = -\nabla U(\mathbf{x}),

where \nabla U(\mathbf{x}) is the gradient of the potential energy.

The Hamiltonian represents the total energy in a system, and its equations describe a system that preserves energy. If the kinetic energy increases, the potential energy decreases, and vice versa.

Canonical Distribution

Another important distribution is the canonical distribution, which describes the probability of being in a specific energy state. It is given by:

p(x) = \frac{1}{Z} e^{\frac{-E(x)}{T}},

where ( E(x) ) is the energy and ( T ) is the temperature. Substituting the Hamiltonian, the canonical distribution becomes:

p(x) = \frac{1}{Z} e^{U(q)} e^{K(p)}.

Key points to note:

  1. The Hamiltonian preserves energy through its trajectories, so during a trajectory, the probability density in the canonical distribution remains constant.
  2. The two terms (( U(q) ) and ( K(p) )) are independent.

The process in HMC starts with:

  1. Sampling momentum from a normal distribution.
  2. Solving the Hamiltonian equations for a while in phase space.
  3. Discarding the momentum dimension and taking a sample from the target distribution.

Since the Hamiltonian equations cannot be solved analytically, symplectic integrators are used. These introduce some error but are volume-preserving, ensuring no drift away from the energy level. A Metropolis-Hastings step is then used to correct for any drift.

Physical Analogies

Neal’s Hockey Puck Analogy

In two dimensions, we can visualize the dynamics as that of a frictionless puck sliding over a surface of varying height. The state consists of the position of the puck (( q )) and the momentum of the puck (( p )). Potential energy (( U(q) )) is proportional to the surface height, and kinetic energy (( K(p) )) is proportional to the square of the momentum. On a level surface, the puck moves at a constant velocity. On a slope, the puck’s momentum decreases as it climbs (potential energy increases) until it stops, then slides back down (potential energy decreases and kinetic energy increases).

Betancourt’s Orbit Analogy

Exploring the typical set is like placing a satellite in a stable orbit around a planet. If the satellite starts at rest, it crashes into the planet. Adding the right amount of momentum allows the satellite to balance gravitational attraction, maintaining a stable orbit. Too little momentum causes a crash; too much causes the satellite to escape.


Questions

  1. Question 1: Why do we find the typical set given that we start in an arbitrary point and when we have found it, why do we stay in it?

    I am trying to intuitively understand this and connect it to the equations stated previously. This question really ties in with the two different physical analogies, and here is where my understanding goes off.

    If we take Betancourt’s analogy with the orbit: all other energy levels except the ones in the typical set stay in the same energy level. When we randomly sample from the momentum, we are moving to another energy level. This sample is independently sampled, not depending on our position, so how can we assume that this will be in the typical set? Furthermore, when we start out, why do we find the typical set? What in the equations determines this?

    If we view it from the perspective of Neal, things feel a bit easier. If we start in a position of high kinetic energy (thus low probability density in our target distribution), the momentum will also be low. If we tie this together with Betancourt’s orbit, this also means that we will start falling toward the mode and thus higher probability density. When we are at high probability density, the momentum will be higher, so we will not get stuck.

    I am trying to tie this together with the equations. What disturbs me about the equations is that, as I have understood, we have the gradients of the target distribution which point towards different local modes. We then flip them such that they are perpendicular to these modes. Why would a random point in space then point towards regions of higher probability density? Why do we then fall into the typical set, and why exactly the typical set? I don’t see any terms of ( p(x)dx ).

  2. Why do we even need the canonical distribution, and why these energy levels? Why do we want to have a system that is energy-preserving?

    This is maybe the largest hurdle to get your head around, and I feel that it ties in together with the previous question. The energy function feels extremely ad-hoc. Why do we even define it? Why do we need it, i feel it ties together with the fact that if the kinetic energy goes up, the potential energy has to go down in some sense… but this is already stated in the HMC system, thus why do we need the canonical distribution?

  3. Where does the normalization constant in our target distribution disappear, and what does this mean intuitively?

    • Potential answer: During the derivations of the Hamiltonian equations, it cancels out since it does not depend on our parameters (the derivative of ( V ) w.r.t ( q ) of something not dependent on ( q ) is zero).
    • What this means intuitively: We are traversing actual posterior space, thus our target distribution when simulating our system—not our target distribution up to a constant.
  4. The idea of the typical set and why Betancourt is so focused on it:

    • Potential answer: The idea of the typical set is really only relevant in super-high dimensions. We can’t traverse the whole posterior space in super-multi-dimensional spaces. We need to be smart about where we are traversing. What happens in multidimensional spaces is that the volume of space outside of the neighborhood of the mode starts to dominate the reduction in probability density moving from the mode to this space, but we can’t go too far. Somewhere in between the tails and the mode is the typical set, where ( p(x)dx ) is the highest.

    • The question: How can we be sure that we actually traverse the typical set and not around the mode? What in the equations states that we will be focusing on the space where ( p(x)dx ) is the highest?

  5. Why is Betancourt so focused on expectations?

    • Potential answer: I thought that what we really want in Bayesian inference are posteriors. Estimating a posterior involves estimating an expectation (the evidence/normalization constant). Furthermore, expectations cover more applications. For example, we could calculate the median of a posterior with expectations, etc. We can reformulate almost anything of interest into expectations.

it’s important to remember that all of this is embedded in a generic metropolis-hastings algorithm with accept/reject steps. so for example:

  • you’ll usually move quicklyish towards the typical set even if you start in a low density region because moves that move you towards lower density regions will tend to be rejected. any mixing mcmc kernel moves you towards the typical set, at least eventually.
  • perhaps the most essential point is that because the MH acceptance probability involves a difference of energies and hamilton’s equations guarantee that energy is exactly preserved (with non-preservation due to the fact that we do not solve hamilton’s equations exactly but only approximiately) acceptance rates are high if our symplectic integrator is doing a good job. and they stay high even though we make pretty larges moves by skating around energy level sets. so basically hamilton’s equations give us a mechanism for making largeish moves that are mostly accepted, and MH correction makes sure that we sample from the correct distribution asymptotically