A framework for trajectory compression on the probability simplex
Abstract. The sequential nature of autoregressive next-token prediction imposes a fundamental speed limit on Large Language Models. While continuous flow models offer a path to parallel generation, they traditionally demand expensive iterative integration. Flow Maps bypass this bottleneck by compressing generative trajectories into single-step mappings—theoretically enabling the generation of full text sequences from noise in a single forward pass. However, standard formulations rely on Euclidean regression losses that are geometrically ill-suited for discrete data. In this work, we resolve this conflict with Discrete Flow Maps, a framework that reconciles trajectory compression with the geometry of the probability simplex. We recast standard flow map training for the discrete domain, aligning the training dynamics with the discrete nature of language.
Consider a vocabulary of \(K\) tokens—for instance, three words: fox, dog, and cat.
Each token is represented as a one-hot vector \(e_i \in \mathbb{R}^K\), forming the vertices of the \((K{-}1)\)-dimensional probability simplex \(\Delta^{K-1}\).
Any probability distribution over the vocabulary is a point inside the simplex. The closer a point is to a vertex, the higher the probability assigned to that token.
The simplex is defined as:
\[\Delta^{K-1} = \{x \in \mathbb{R}^K : x \geq 0,\; \langle \mathbf{1}, x \rangle = 1\}\]
To generate discrete data we want an ODE
\[\dot x_s = b_s(x_s), \qquad x_0 \sim \rho_0,\]
that transports a continuous base distribution \(\rho_0\) in \(\mathbb{R}^K\) to the discrete target \(\rho_1\) supported on the simplex vertices. How do we learn the velocity field \(b_s\)?
We need training pairs that tell us where the flow should go at each time. The stochastic interpolant provides exactly this: given a base sample \(I_0 \sim \rho_0\) and a data token \(I_1 \in \{e_1, \dots, e_K\}\) drawn from \(\rho_1\), it defines a straight-line path between them:
\[I_t = (1 - t)\, I_0 + t\, I_1, \qquad t \in [0, 1].\]
Each base–data pair \((I_0, I_1)\) traces a straight line from the base distribution toward a simplex vertex. The interpolant \(I_t\) slides along this line as \(t\) increases. At \(t{=}0\) we recover the base; at \(t{=}1\) we arrive at a one-hot vertex. At any intermediate time the population of \(I_t\) points forms a cloud that smoothly evolves from continuous to discrete.
The interpolant tells us the desired direction at every point. Each interpolant has slope \(\dot{I}_s = I_1 - I_0\). But many interpolants pass through any fixed point \(x_s\). The right velocity field averages over all of them:
\[b_s(x_s) = \mathbb{E}[\dot{I}_s \mid I_s = x_s].\]
The instantaneous denoiser is the conditional expectation of the clean data given the noisy observation:
\[\begin{aligned}\psi_{s,s}(x_s) &= \mathbb{E}[I_1 \mid I_s = x_s] \\ &= \sum_{i=1}^{K} \underbrace{\mathbb{P}(I_1 = e_i \mid I_s = x_s)}_{p_i}\, e_i.\end{aligned}\]
The velocity field \(b_s(x_s)\) points directly toward the instantaneous denoiser \(\psi_{s,s}(x_s)\).
Since the weights \(p_i\) are non-negative and sum to one, \(\psi_{s,s}\) is a convex combination of vertices and therefore always a valid probability distribution on the simplex. This geometric fact means we can learn the instantaneous denoiser with a cross-entropy loss:
\[\psi_{s,s} = \argmin_{\hat \psi_{s,s}}\mathbb{E}\left[ -\sum_{k=1}^K I_1^{(k)} \log \hat{\psi}_{s,s}^{(k)}(I_s) \right].\]
The generative ODE
\[\dot x_s = b_s(x_s), \qquad x_0 \sim \rho_0,\]
traces a trajectory from random noise \(x_0\) to a data point \(x_1\) (a simplex vertex). We mark two points along the trajectory: an earlier time \(x_s\) and a later time \(x_t\). The tangent \(b_s(x_s)\) at \(x_s\) projects onto the simplex at the instantaneous denoiser \(\psi_{s,s}\).
As we integrate along the trajectory from \(s\) to \(t\), we define the mean denoiser \(\psi_{s,t}\)—a time-weighted average of the instantaneous denoisers along the path:
\[\psi_{s,t}(x_s) = \int_s^t w(u)\,\underbrace{\mathbb{E}[I_1 \mid I_u = x_u]}_{\psi_{u,u}(x_u)}\, du\]
Since each \(\mathbb{E}[I_1 \mid I_u = x_u]\) is on the simplex and \(w(u)\) is a non-negative weight function that integrates to 1 over \([s,t]\), the mean denoiser is a convex combination of simplex points—so \(\psi_{s,t}(x_s) \in \Delta^{K-1}\) also lives on the simplex.
Geometrically, just as the tangent at \(x_s\) projects to the instantaneous denoiser \(\psi_{s,s}\), the secant from \(x_s\) to \(x_t\) projects to the mean denoiser \(\psi_{s,t}\) on the simplex.
The flow map \(X_{s,t}\) jumps directly from time \(s\) to time \(t\) along the ODE trajectory \((x_u)_{u \in [0,1]}\). Since the true flow map is a convex combination of the current state and the mean denoiser:
\[X_{s,t}(x_s) = \tfrac{1-t}{1-s}\,x_s + \tfrac{t-s}{1-s}\,\psi_{s,t}(x_s),\]
we parameterize our map \(\hat X_{s,t}\) in terms of a neural network \(\hat \psi_{s,t} \in \Delta^{K-1}\) in the same way.
At \(t = 1\), the map reduces to the mean denoiser itself: \(X_{s,1}(x_s) = \psi_{s,1}(x_s)\).
As \(x_t\) evolves along the trajectory, the mean denoiser \(\psi_{s,t}(x_s)\) tracks it like a shadow cast onto the simplex—always inside the triangle.
The flow map need not lie on the simplex, but the mean denoiser is always a valid probability distribution—so we can train it with KL divergence losses, respecting the geometry of the simplex. We next describe four self-consistency conditions that provide training targets for \(\psi_{s,t}\).
The flow map must satisfy the semi-group property: jumping from \(s\) to \(t\) must equal jumping from \(s\) to \(u\), then from \(u\) to \(t\).
This is equivalent to requiring the following convex decomposition of the mean denoiser:
\[\psi_{s,t}(x_s) = \alpha_{s,u,t}\,\psi_{s,u}(x_s) + (1-\alpha_{s,u,t})\,\psi_{u,t}(X_{s,u}(x_s)),\] where \(\alpha_{s,u,t} \in [0,1]\) is a time-dependent weight.
Since both sides of the above equation are probability distributions (the right-side is a convex combination of points on the simplex), we can enforce this identity via KL divergence rather than L\(^2\) regression:
\[\mathcal{L}_{\text{PSD}} = \mathbb{E}\Big[D_{\text{KL}}\big(\text{sg}[\alpha\,\hat\psi_{s,u} + (1-\alpha)\,\hat\psi_{u,t}(\hat X_{s,u})]\;\|\;\hat\psi_{s,t}\big)\Big].\]
The Lagrangian perspective requires that the flow endpoint velocity \(\partial_t X_{s,t}(x_s)\) matches the instantaneous drift \(b_t(x_t)\). In terms of the mean denoiser, this is equivalent to:
\[\psi_{s,t}(x_s) = \psi_{t,t}(x_t) - C_{s,t}\,\tfrac{\partial}{\partial t}\psi_{s,t}(x_s)\]
This means the tangent to the curve \(\psi_{s,t}(x_s)\) is always parallel to the direction \(\psi_{t,t}(x_t) - \psi_{s,t}(x_s)\).
The right-side of the above equation need not be a probability distribution when parametrized by arbitrary neural networks. To obtain a valid target distribution, we derive an equivalent condition in logit space:
\[\psi_{s,t} = \text{Softmax}\!\big(z_{t,t}(x_t) - \log(\mathbf{1} + C_{s,t}(\dot z_{s,t} - \bar{\dot z}_{s,t}\mathbf{1}))\big).\]
This logit-space formulation always yields a valid probability distribution, enabling KL divergence training.
The Eulerian perspective enforces that the flow map is invariant to the source time \(\partial_s (X_{s,t}(x_s)) = 0\). In terms of the mean denoiser, this is equivalent to:
\[\frac{\partial}{\partial s}(\psi_{s,t}(x_s)) = \kappa_{s,t}\big(\psi_{s,t}(x_s) - \psi_{s,s}(x_s)\big),\]
where \(\kappa_{s,t}\) is a time-dependent scalar.
Again, we derive an equivalent logit-space condition:
\[\psi_{s,t} = \text{Softmax} \big(z_{s,s}(x) - \log(\mathbf{1} - \kappa_{s,t}^{-1}(D_s z_{s,t} - \overline{D}_s z_{s,t}\,\mathbf{1}))\big),\]
which always yields a valid probability distribution for KL training.
Taking the derivative of the semi-group condition with respect to the intermediate time \(u\) yields a fourth identity:
\[\frac{d}{du}\big[\alpha_{s,u,t}\,\psi_{s,u}(x) + (1-\alpha_{s,u,t})\,\psi_{u,t}(X_{s,u}(x))\big] = 0.\]
Setting \(u = s\) recovers ESD; setting \(u = t\) recovers LSD. This unifies the previous differential consistency rules as special cases of one differential identity.
All four consistency losses—PSD, LSD, ESD, and dPSD—are exact objectives derived directly from the geometric definitions of the flow, with no approximations.
Minimizing any of these losses with the diagonal cross-entropy loss, ensures the learned flow map \(\hat X_{s,t}\) matches the true flow map \(X_{s,t}\) and therefore generates the correct data distribution at \(t=1\).
Unlike autoregressive models that generate one token at a time, discrete flow maps operate on all \(L\) token positions simultaneously.
Each position has its own simplex. The flow map transports noise to data across all positions in parallel, enabling generation of an entire sequence in a single forward pass.
The flows from Gaussian noise converge to simplex vertices—each vertex representing a token in the vocabulary. The full sequence emerges simultaneously rather than sequentially.
A pre-trained discrete flow map may not satisfy downstream objectives out of the box. For instance, a base model might produce an incorrect arithmetic result.
The base (green) dynamics land on tokens that form a plausible but wrong answer.
Fine-tuning steers the flow map dynamics toward a reward signal. The adapted (red) trajectories are nudged so that the landing tokens produce the correct answer.
Because each position is independent, steering respects the parallel structure—all positions are corrected simultaneously.
We validate that respecting the geometry of the simplex allows us to surpass previous state-of-the-art results in discrete flow modeling. Results coming soon.