Discrete Flow Maps

The Probability Simplex

Consider a vocabulary of \(K\) tokens—for instance, three words: fox, dog, and cat.

Each token is represented as a one-hot vector \(e_i \in \mathbb{R}^K\), forming the vertices of the \((K{-}1)\)-dimensional probability simplex \(\Delta^{K-1}\).

Any probability distribution over the vocabulary is a point inside the simplex. The closer a point is to a vertex, the higher the probability assigned to that token.

The simplex is defined as:

\[\Delta^{K-1} = \{x \in \mathbb{R}^K : x \geq 0,\; \langle \mathbf{1}, x \rangle = 1\}\]

The Stochastic Interpolant

To generate discrete data we want an ODE

\[\dot x_s = b_s(x_s), \qquad x_0 \sim \rho_0,\]

that transports a continuous base distribution \(\rho_0\) in \(\mathbb{R}^K\) to the discrete target \(\rho_1\) supported on the simplex vertices. How do we learn the velocity field \(b_s\)?

We need training pairs that tell us where the flow should go at each time. The stochastic interpolant provides exactly this: given a base sample \(I_0 \sim \rho_0\) and a data token \(I_1 \in \{e_1, \dots, e_K\}\) drawn from \(\rho_1\), it defines a straight-line path between them:

\[I_t = (1 - t)\, I_0 + t\, I_1, \qquad t \in [0, 1].\]

Each base–data pair \((I_0, I_1)\) traces a straight line from the base distribution toward a simplex vertex. The interpolant \(I_t\) slides along this line as \(t\) increases. At \(t{=}0\) we recover the base; at \(t{=}1\) we arrive at a one-hot vertex. At any intermediate time the population of \(I_t\) points forms a cloud that smoothly evolves from continuous to discrete.

The Velocity Field

The interpolant tells us the desired direction at every point. Each interpolant has slope \(\dot{I}_s = I_1 - I_0\). But many interpolants pass through any fixed point \(x_s\). The right velocity field averages over all of them:

\[b_s(x_s) = \mathbb{E}[\dot{I}_s \mid I_s = x_s].\]

The Instantaneous Denoiser

The instantaneous denoiser is the conditional expectation of the clean data given the noisy observation:

\[\begin{aligned}\psi_{s,s}(x_s) &= \mathbb{E}[I_1 \mid I_s = x_s] \\ &= \sum_{i=1}^{K} \underbrace{\mathbb{P}(I_1 = e_i \mid I_s = x_s)}_{p_i}\, e_i.\end{aligned}\]

The velocity field \(b_s(x_s)\) points directly toward the instantaneous denoiser \(\psi_{s,s}(x_s)\).

Diagonal Cross-Entropy

Since the weights \(p_i\) are non-negative and sum to one, \(\psi_{s,s}\) is a convex combination of vertices and therefore always a valid probability distribution on the simplex. This geometric fact means we can learn the instantaneous denoiser with a cross-entropy loss:

\[\psi_{s,s} = \argmin_{\hat \psi_{s,s}}\mathbb{E}\left[ -\sum_{k=1}^K I_1^{(k)} \log \hat{\psi}_{s,s}^{(k)}(I_s) \right].\]

The Mean Denoiser

The generative ODE

\[\dot x_s = b_s(x_s), \qquad x_0 \sim \rho_0,\]

traces a trajectory from random noise \(x_0\) to a data point \(x_1\) (a simplex vertex). We mark two points along the trajectory: an earlier time \(x_s\) and a later time \(x_t\). The tangent \(b_s(x_s)\) at \(x_s\) projects onto the simplex at the instantaneous denoiser \(\psi_{s,s}\).

As we integrate along the trajectory from \(s\) to \(t\), we define the mean denoiser \(\psi_{s,t}\)—a time-weighted average of the instantaneous denoisers along the path:

\[\psi_{s,t}(x_s) = \int_s^t w(u)\,\underbrace{\mathbb{E}[I_1 \mid I_u = x_u]}_{\psi_{u,u}(x_u)}\, du\]

Since each \(\mathbb{E}[I_1 \mid I_u = x_u]\) is on the simplex and \(w(u)\) is a non-negative weight function that integrates to 1 over \([s,t]\), the mean denoiser is a convex combination of simplex points—so \(\psi_{s,t}(x_s) \in \Delta^{K-1}\) also lives on the simplex.

Geometrically, just as the tangent at \(x_s\) projects to the instantaneous denoiser \(\psi_{s,s}\), the secant from \(x_s\) to \(x_t\) projects to the mean denoiser \(\psi_{s,t}\) on the simplex.

The Discrete Flow Map

The flow map \(X_{s,t}\) jumps directly from time \(s\) to time \(t\) along the ODE trajectory \((x_u)_{u \in [0,1]}\). Since the true flow map is a convex combination of the current state and the mean denoiser:

\[X_{s,t}(x_s) = \tfrac{1-t}{1-s}\,x_s + \tfrac{t-s}{1-s}\,\psi_{s,t}(x_s),\]

we parameterize our map \(\hat X_{s,t}\) in terms of a neural network \(\hat \psi_{s,t} \in \Delta^{K-1}\) in the same way.

At \(t = 1\), the map reduces to the mean denoiser itself: \(X_{s,1}(x_s) = \psi_{s,1}(x_s)\).

As \(x_t\) evolves along the trajectory, the mean denoiser \(\psi_{s,t}(x_s)\) tracks it like a shadow cast onto the simplex—always inside the triangle.

The flow map need not lie on the simplex, but the mean denoiser is always a valid probability distribution—so we can train it with KL divergence losses, respecting the geometry of the simplex. We next describe four self-consistency conditions that provide training targets for \(\psi_{s,t}\).

Semi-Group Consistency (PSD)

The flow map must satisfy the semi-group property: jumping from \(s\) to \(t\) must equal jumping from \(s\) to \(u\), then from \(u\) to \(t\).

This is equivalent to requiring the following convex decomposition of the mean denoiser:

\[\psi_{s,t}(x_s) = \alpha_{s,u,t}\,\psi_{s,u}(x_s) + (1-\alpha_{s,u,t})\,\psi_{u,t}(X_{s,u}(x_s)),\] where \(\alpha_{s,u,t} \in [0,1]\) is a time-dependent weight.

Since both sides of the above equation are probability distributions (the right-side is a convex combination of points on the simplex), we can enforce this identity via KL divergence rather than L\(^2\) regression:

\[\mathcal{L}_{\text{PSD}} = \mathbb{E}\Big[D_{\text{KL}}\big(\text{sg}[\alpha\,\hat\psi_{s,u} + (1-\alpha)\,\hat\psi_{u,t}(\hat X_{s,u})]\;\|\;\hat\psi_{s,t}\big)\Big].\]

Lagrangian Consistency (LSD)

The Lagrangian perspective requires that the flow endpoint velocity \(\partial_t X_{s,t}(x_s)\) matches the instantaneous drift \(b_t(x_t)\). In terms of the mean denoiser, this is equivalent to:

\[\psi_{s,t}(x_s) = \psi_{t,t}(x_t) - C_{s,t}\,\tfrac{\partial}{\partial t}\psi_{s,t}(x_s)\]

This means the tangent to the curve \(\psi_{s,t}(x_s)\) is always parallel to the direction \(\psi_{t,t}(x_t) - \psi_{s,t}(x_s)\).

The right-side of the above equation need not be a probability distribution when parametrized by arbitrary neural networks. To obtain a valid target distribution, we derive an equivalent condition in logit space:

\[\psi_{s,t} = \text{Softmax}\!\big(z_{t,t}(x_t) - \log(\mathbf{1} + C_{s,t}(\dot z_{s,t} - \bar{\dot z}_{s,t}\mathbf{1}))\big).\]

This logit-space formulation always yields a valid probability distribution, enabling KL divergence training.

Eulerian Consistency (ESD)

The Eulerian perspective enforces that the flow map is invariant to the source time \(\partial_s (X_{s,t}(x_s)) = 0\). In terms of the mean denoiser, this is equivalent to:

\[\frac{\partial}{\partial s}(\psi_{s,t}(x_s)) = \kappa_{s,t}\big(\psi_{s,t}(x_s) - \psi_{s,s}(x_s)\big),\]

where \(\kappa_{s,t}\) is a time-dependent scalar.

Again, we derive an equivalent logit-space condition:

\[\psi_{s,t} = \text{Softmax} \big(z_{s,s}(x) - \log(\mathbf{1} - \kappa_{s,t}^{-1}(D_s z_{s,t} - \overline{D}_s z_{s,t}\,\mathbf{1}))\big),\]

which always yields a valid probability distribution for KL training.

Differential Semi-Group (dPSD)

Taking the derivative of the semi-group condition with respect to the intermediate time \(u\) yields a fourth identity:

\[\frac{d}{du}\big[\alpha_{s,u,t}\,\psi_{s,u}(x) + (1-\alpha_{s,u,t})\,\psi_{u,t}(X_{s,u}(x))\big] = 0.\]

Setting \(u = s\) recovers ESD; setting \(u = t\) recovers LSD. This unifies the previous differential consistency rules as special cases of one differential identity.

Exact Training Objectives

All four consistency losses—PSD, LSD, ESD, and dPSD—are exact objectives derived directly from the geometric definitions of the flow, with no approximations.

Minimizing any of these losses with the diagonal cross-entropy loss, ensures the learned flow map \(\hat X_{s,t}\) matches the true flow map \(X_{s,t}\) and therefore generates the correct data distribution at \(t=1\).