Aligning Latent Spaces with Flow Priors

1The University of Hong Kong    2ARC Lab, Tencent PCG

Abstract

This paper presents a novel framework for aligning learnable latent spaces to arbitrary target distributions by leveraging flow-based generative models as priors. Our method first pretrains a flow model on the target features to capture the underlying distribution. This fixed flow model subsequently regularizes the latent space via an alignment loss, which reformulates the flow matching objective to treat the latents as optimization targets. We formally prove that minimizing this alignment loss establishes a computationally tractable surrogate objective for maximizing a variational lower bound on the log-likelihood of latents under the target distribution. Notably, the proposed method eliminates computationally expensive likelihood evaluations and avoids ODE solving during optimization. As a proof of concept, we demonstrate in a controlled setting that the alignment loss landscape closely approximates the negative log-likelihood of the target distribution. We further validate the effectiveness of our approach through large-scale image generation experiments on ImageNet with diverse target distributions, accompanied by detailed discussions and ablation studies. With both theoretical and empirical validation, our framework paves a new way for latent space alignment.

Method Overview

Teaser

(a) Conventional alignment works with only known priors (e.g., Gaussian or categorical) using KL or cross-entropy losses. (b) Our proposed method can align the latent distribution to arbitrary target distribution captured by a pre-trained flow model.

Our approach addresses a fundamental challenge in representation learning: how to align learnable latent spaces with arbitrary target distributions. We propose a two-stage framework that leverages the power of flow-based models as flexible distributional priors.

Arbitrary Target Distributions Align to any distribution, even those implicitly defined by samples
Computationally Efficient Single forward pass through pre-trained flow model
Theoretical Foundation Proven connection to variational lower bound

Method

Two-Stage Pipeline

Let $\mathbf{y} \in \mathbb{R}^{d_1}$ denote samples from a learnable latent space (e.g., encoder outputs), and $\mathbf{x} \in \mathbb{R}^{d_2}$ represent samples from a target feature space with distribution $p_{\mathrm{data}}(\mathbf{x})$. Our goal is to train the latent model such that $p_\phi(\mathbf{y})$ aligns with $p_{\mathrm{data}}(\mathbf{x})$.

1

Flow Prior Estimation

Train a flow model $\mathbf{v}_\theta: \mathbb{R}^{d_1} \times [0,1] \rightarrow \mathbb{R}^{d_1}$ on projected target features using the standard flow matching objective. After training, we freeze $\theta$ to obtain a fixed distributional prior.

2

Latent Space Regularization

Use the fixed flow model to regularize learnable latents $\mathbf{y}$ through our proposed alignment loss, encouraging consistency with the learned flow dynamics.

Alignment Loss

The core of our approach is the alignment loss that reformulates the flow matching objective to treat latents as targets:

$$\mathcal{L}_{\text{align}}(\mathbf{y}; \theta) = \mathbb{E}_{t \sim \mathcal{U}[0, 1], \mathbf{x}_0 \sim p_{\mathrm{init}}} \left[ \| \mathbf{v}_\theta((1-t)\mathbf{x}_0 + t \mathbf{y}, t) - (\mathbf{y} - \mathbf{x}_0) \|^2 \right]$$

This loss measures how well the pre-trained velocity field $\mathbf{v}_\theta$ can predict the velocity along straight paths from noise $\mathbf{x}_0$ to latents $\mathbf{y}$. Minimizing this loss encourages $\mathbf{y}$ to lie in regions that are consistent with the learned flow dynamics.

Theoretical Foundation

We establish a formal connection between our alignment loss and maximum likelihood estimation under the flow-defined distribution:

$$\log p_1^{\mathbf{v}_\theta}(\mathbf{y}) \geq C(\mathbf{y}) - \mathcal{L}_{\text{align}}(\mathbf{y}; \theta)$$

where $C(\mathbf{y})$ represents the expected negative divergence of the velocity field along variational paths. This shows that minimizing our alignment loss serves as a computationally tractable proxy for maximizing the log-likelihood under the target distribution.

Toy Example

Validation on Controlled Setting

To validate our theoretical insights, we conduct experiments on a mixture of Gaussians target distribution. We train a flow model on samples from this mixture and demonstrate that our alignment loss serves as an effective proxy for the true negative log-likelihood.

NLL heatmap

(a)

Alignment loss heatmap

(b)

Loss curves

(c)

Illustration with a Mixture of Gaussians distribution. (a) Aligned latent variables $\mathbf{y}$ (red triangles) concentrate in low negative log-likelihood (NLL) regions of $p_{\text{data}}$ (blue dots; heatmap shows $-\log p_{\text{data}}$). (b) Alignment loss $\mathcal{L}_{\text{align}}$ heatmap mirrors the NLL landscape of $p_{\text{data}}$, with $p_{\text{data}}$ samples in low-$\mathcal{L}_{\text{align}}$ areas. (c) $\mathcal{L}_{\text{align}}$ (blue solid) and $-\log p_{\text{data}}(\mathbf{y})$ (red dashed) decline simultaneously in training, showing $\mathcal{L}_{\text{align}}$ serves as a proxy for maximizing the log-likelihood of $\mathbf{y}$ under $p_{\text{data}}$.

Training Progress Across Diverse Distributions

To demonstrate the generality of our approach, we visualize the evolution of optimized variables $\mathbf{y}$ during training across various toy target distributions. Each animation shows how minimizing $\mathcal{L}_{\text{align}}$ guides the latent variables (red triangles) to converge towards high-density regions of different target distributions.

Training progress - Mixture of Gaussians

Gaussian Mixture

Training progress - Grid

Gaussian Grid

Training progress - Two Moons

Two Moons

Training progress - Concentric Rings

Rings

Training progress - Spiral

Spiral

Training progress - Swiss Roll

Swiss Roll

ImageNet Results

Alignment Loss Validation

We validate our approach through comprehensive experiments on ImageNet-1K, demonstrating effectiveness across diverse target distributions including VAE (low-level), DINO (semantic), VQ (discrete), and Qwen (textual). Our method consistently achieves superior alignment while maintaining computational efficiency.

VAE alignment results

VAE (Low-level)

DinoV2 alignment results

DinoV2 (Semantic)

VQ alignment results

VQ (Discrete)

Qwen alignment results

Qwen (Textual)

Aligning autoencoders on ImageNet-1K with different target distributions. The alignment loss $\mathcal{L}_{\text{align}}$ (blue solid) and the $k$-NN distance $\log r_k(\mathbf{y})$ (red dashed) are proportional throughout the training. Confirming that $\mathcal{L}_{\text{align}}$ serves as a proxy for the NLL of the latents under $p_{\text{data}}$.

Large-Scale Image Generation

We evaluate our approach on ImageNet-1K using MAR-B with flow heads for conditional generation. To ensure fair comparison, all models use identical training configurations - only the target distribution for alignment differs.

Autoencoder rFID↓ PSNR↑ w/o CFG w/ CFG
FID↓ IS↑ Pre.↑ Rec.↑ FID↓ IS↑ Pre.↑ Rec.↑
AE 1.13 20.20 15.08 86.37 0.60 0.59 5.26 237.60 0.56 0.65
KL 1.65 22.59 12.94 91.86 0.60 0.58 5.29 200.85 0.57 0.65
SoftVQ 0.61 23.00 13.30 93.40 0.60 0.57 6.09 198.53 0.58 0.61
Low-level (VAE) 1.22 22.31 12.04 98.66 0.56 0.57 5.02 240.03 0.56 0.62
Semantic (Dino) 1.26 23.07 11.47 101.74 0.59 0.59 4.87 250.38 0.54 0.67
Discrete (VQ) 2.99 22.32 24.63 48.17 0.55 0.53 10.04 119.64 0.47 0.65
Textual (Qwen) 0.85 23.12 11.89 102.23 0.55 0.57 6.56 262.89 0.49 0.69
  • Alignment vs. Reconstruction Trade-off: Structured latent spaces slightly reduce reconstruction quality (higher rFID) but significantly improve generation metrics (lower FID, higher IS)
  • Semantic Features Excel: DinoV2 semantic features achieve the best FID scores, demonstrating that meaningful feature representations enhance generation quality
  • Cross-modal Benefits: Text embeddings (Qwen) show competitive performance, suggesting that cross-modal alignment can transfer structural benefits to visual generation
  • Distribution Complexity Matters: Low-dimensional discrete features (VQ) underperform, while richer representations consistently improve results

Citation

@misc{li2025aligning,
  title={Aligning Latent Spaces with Flow Priors},
  author={Li, Yizhuo and Ge, Yuying and Ge, Yixiao and Shan, Ying and Luo, Ping},
  year={2025},
  url={https://arxiv.org/abs/2506.05240}
}