Beyond Smooth Densities: Mastering Composite Log-Concave Sampling

If you've ever tried to sample from a high-dimensional posterior and found your standard Langevin samplers crawling to a halt because of a non-smooth prior like an L1 penalty, you're facing a fundamental limit of gradient-based MCMC. Most common algorithms assume the entire log-density is differentiable. However, in modern machine learning, we often encounter "composite" structures where a smooth likelihood is paired with a non-smooth regularizer. Forcing a gradient on a non-differentiable point leads to instability or, worse, a complete failure to converge within a reasonable timeframe.

The Challenge of Composite Log-Concave Landscapes

When we deal with a target density proportional to $e^{-f-g}$, where $f$ is smooth but $g$ is not, traditional samplers treat the whole exponent as a single entity. This is often a mistake. The function $f$ usually represents the data's general trend, while $g$ enforces structural constraints like sparsity or boundary conditions. Developers often struggle here: should you smooth out $g$ and lose the sharp features of your model, or stick with a slow, jagged sampler?

Proximal gradient algorithms offer a third way. By borrowing the "proximal operator" concept from the optimization world, we can handle $f$ and $g$ separately. Instead of trying to find a single gradient for a function that doesn't have one everywhere, we use the gradient for the smooth part and a specialized "proximal step" for the non-smooth part. It is a divide-and-conquer strategy for probability spaces.

The Restricted Gaussian Oracle (RGO) as an Engine

The secret sauce in this approach is the Restricted Gaussian Oracle (RGO). In optimization, a proximal operator finds the point that balances staying close to the current position while minimizing the non-smooth function $g$. In sampling, the RGO does something more sophisticated: it samples from a distribution that balances a Gaussian centered at the current point with the weight of the function $g$.

This isn't just a random jump. It's a mathematically rigorous way to respect the constraints of $g$ while exploring the space defined by $f$. From my perspective, the beauty of this method lies in how it decouples the complexity. As long as you have an efficient way to implement the RGO—which is true for many common penalties like L1 or box constraints—the overall sampling efficiency becomes independent of the "sharpness" of $g$.

Performance Trade-offs and Practical Realities

In theory, proximal sampling can achieve convergence rates that are significantly better than naive random-walk MCMC in high dimensions. While a standard Metropolis-Hastings might require $O(d^2)$ steps to explore a space effectively, proximal methods leverage the log-concave structure to cut through the complexity. However, there is no such thing as a free lunch.

The primary overhead is the computational cost of the RGO itself. If $g$ is a simple function, the RGO is nearly instantaneous. But if $g$ is complex, you might find yourself running a nested sampling loop, which can negate the speed gains. I've found that this algorithm is most effective when the non-smooth component has a well-known structure. If you're dealing with arbitrary, black-box non-smoothness, the implementation complexity might outweigh the theoretical benefits.

Implementation Patterns for the Modern Developer

When putting this into production, the choice of step size is your most critical lever. A step size that is too aggressive will lead to rejection or instability during the RGO phase, while one that is too conservative will make your sampler feel like it's stuck in molasses. A good rule of thumb is to scale your step size inversely to the smoothness (Lipschitz constant) of the $f$ component.

Another pattern to consider is the use of a "correction" step. Because the RGO might only be an approximation in some implementations, adding a Metropolis-Hastings acceptance/rejection step at the end of each iteration can ensure that you are sampling from the exact target distribution. It adds a bit of overhead but provides the mathematical guarantee that your results aren't biased by the discretization of the proximal step.

Ultimately, the shift from pure gradient-based methods to proximal methods represents a more nuanced understanding of probability landscapes. Stop trying to smooth out the world; instead, choose an algorithm that understands the edges. The most robust models are often those that respect the inherent non-smoothness of the data rather than trying to hide it.

Reference: arXiv CS.LG (Machine Learning)

The Challenge of Composite Log-Concave Landscapes

The Restricted Gaussian Oracle (RGO) as an Engine

Performance Trade-offs and Practical Realities

Implementation Patterns for the Modern Developer

Related Articles