A comprehensive introduction to flow matching for generative modeling. We’ll explore the mathematical foundations, derive key results, and implement practical examples with PyTorch.
Recommended background: Undergraduate-level calculus and probability theory
Introduction
“Creating noise from data is easy; creating data from noise is generative modeling.” (Song et al., 2021)
This quote beautifully captures the essence of modern generative models, among them the image & video generation models that we all play with.
In more technical terms, we are given:
- A dataset of samples \(x_1, x_2, \ldots, x_n \in \mathbb{R}^d\) drawn from an unknown data distribution \(q(x)\)
- A simple prior distribution \(p_0(x)\) (often Gaussian noise)
and want to learn a mapping \(T : \mathbb{R}^d \to \mathbb{R}^d\) that can generate new data points starting from the prior distribution so that the generated samples \(T(x_0)\) closely follow the true data distribution \(q(x)\). There are many approaches to learning this mapping \(T\), but in this post, we will focus on flow matching, a recent generative modeling framework that is both simple and scalable. Flow matching has powered state-of-the-art generative models in a wide range of modalities, including images and videos (Esser et al., 2024; Polyak et al., 2025), speech (Le et al., 2023; A. H. Liu et al., 2024), audio & music (Prajwal et al., 2024; Vyas et al., 2023), and protein structures (Bose et al., 2024; Jing et al., 2024).
In the past two years or so, hundreds of papers have proposed improvements to flow matching, but I will focus only on the original “basic” version of it. As I don’t feel very confident in my math, I will try to avoid complex proofs, so please read the linked resources for more details.
Continuous Normalizing Flows
Consider a time-dependent vector field (velocity field) \(v : [0, 1] \times \mathbb{R}^d \to \mathbb{R}^d\) that smoothly evolves samples drawn from a source distribution \(p_0(x)\). This velocity field induces a time-dependent mapping, called a flow\(\phi: [0, 1] \times \mathbb{R}^d \to \mathbb{R}^d\), defined as the solution to the following ordinary differential equation (ODE):
The velocity field \(v_t\) and the induced probability path \(p_t\) are linked to each other by the continuity equation:
\[
\frac{\partial p_t}{\partial t} + \nabla \cdot (p_t v_t) = 0
\tag{3}\] where \(p_t v_t\) denotes the probability flux and \(\nabla \cdot\) is the divergence operator (defined as \(\nabla \cdot F = \sum_{i=1}^d \frac{\partial F_i}{\partial x_i}\) for a given vector field \(F : \mathbb{R}^d \to \mathbb{R}^d\)).
ELI5: Continuity Equation
The continuity equation has its roots in physics with a notable application in fluid dynamics, where it describes the conservation of mass of a fluid flowing with a specific velocity. In simple terms, the equation expresses the conservation of a quantity, e.g. density of a fluid at a specific location changes only if there is a flow of fluid into or out of that location. Similarly, probability is a quantity that is not created or destroyed (always sums to 1); it just moves around guided by the velocity field \(v_t\).
Chen et al. (2018) proposed to model such a velocity field \(v_\theta\) with a neural network, where \(\theta \in \mathbb{R}^p\) are the parameters of the network, and named the resulting flow models continuous normalizing flows (CNF).
For ML folks
CNFs can be seen as an extension of traditional normalizing flows, moving from a sequence of discrete transformations to a single continuous transformation utilizing the instantaneous version of the change of variable formula through the continuity equation.
The goal of CNFs is to learn a velocity field \(v_\theta(t, x_t)\) such that the induced probability path \(p_t(x)\) ends up matching the true data distribution \(q(x)\) at time \(t=1\). CNFs achieve this by training the model with the maximum likelihood objective: \[
\mathcal{L_\theta} = \mathbb{E}_{x \sim q} \left[ \log p_1(x) \right]
\tag{4}\] where we can derive the log-likelihood as: \[
\log p_1(x_1) = \log p_0(x_0) - \int_0^1 (\nabla \cdot v_\theta)(x_t) dt
\tag{5}\]
Proof
From the continuity equation, we have: \[
\frac{\partial p_t}{\partial t} + \nabla \cdot (p_t v_t) = 0 \tag{1}
\] We can expand the divergence term using the product rule: \[
\nabla \cdot (p_t v_t) = (\nabla p_t) \cdot v_t + p_t (\nabla \cdot v_t)
\] Substituting this back into the continuity equation, we get: \[
\frac{\partial p_t}{\partial t} = - (\nabla p_t) \cdot v_t - p_t (\nabla \cdot v_t)
\] Using \(\frac{\partial log f}{\partial t} = \frac{1}{f} \frac{\partial f}{\partial t}\), \(\nabla(log f) = \frac{1}{f} \nabla f\), and dividing both sides by \(p_t\), we get: \[
\frac{\partial \log p_t}{\partial t} = - (\nabla \log p_t) \cdot v_t - (\nabla \cdot v_t) \tag{2}
\] Now consider the change in \(\log p_t(x_t)\) along a trajectory \(x_t\). We can calculate the total derivative using the chain rule: \[
\frac{d}{dt} \log p_t(x_t) = \frac{\partial \log p_t(x_t)}{\partial t} + \nabla \log p_t(x_t) \cdot \frac{d x_t}{dt}
\] Substituting (2) and the fact that \(\frac{d x_t}{dt} = v_t(x_t)\), we get: \[
\frac{d}{dt} \log p_t(x_t) = - (\nabla \log p_t(x_t)) \cdot v_t(x_t) - (\nabla \cdot v_t(x_t)) + (\nabla \log p_t(x_t)) \cdot v_t(x_t)
\]\[
\frac{d}{dt} \log p_t(x_t) = - (\nabla \cdot v_t(x_t)) \tag{3}
\]
Finally, integrating both sides from \(t=0\) to \(t=1\), we arrive at: \[
\log p_1(x_1) = \log p_0(x_0) - \int_0^1 (\nabla \cdot v_t)(x_t) dt \tag{4}
\]
This is cool and all, but that pesky integral really limits the scalability of CNFs. Training a CNF requires simulating an expensive ODE at each time step, which becomes prohibitive for high-dimensional complex datasets. This is where flow matching comes into play, as it allows us to learn such a velocity field in a simulation-free manner.
Flow Matching
The idea of flow matching (Albergo & Vanden-Eijnden, 2023; Lipman et al., 2023; X. Liu et al., 2023) is really simple: instead of maximizing the likelihood of the data, we try to match the ground-truth velocity field. Thus, the training objective turns into a regression problem: \[
\mathcal{L_{\text{FM}}}(\theta) = \mathbb{E}_{t \sim U[0, 1], x \sim p_t} \left[ \left\| v_\theta(t, x) - v(t, x) \right\|^2 \right]
\tag{6}\]
Due to the relationship between the velocity field and the probability path, we can see that minimizing the flow matching loss to zero leads to a perfect velocity field that at time \(t=1\) induces a distribution \(p_1(x)\) that matches the true data distribution \(q(x)\). Getting rid of the simulation during training and arriving at a simple regression problem is honestly amazing, though not very useful yet. If we already knew the ground truth velocity field \(v(t, x)\) and its corresponding probability path \(p_t(x)\), we would not have to learn anything in the first place.
Furthermore, it is easy to see that for a given pair of source and target distributions \((p_0, p_1)\), there are infinitely many probability paths \(p_t\) (and thus infinitely many velocity fields \(v_t\)) that can interpolate between the two. So, the question arises: how do we design the velocity field/probability path we want to match?
I am using the concepts of velocity field and probability path interchangeably here, by which I mean they are linked together through the continuity equation Equation 3.
Conditional Flow Matching
This is where the key idea of conditional flow matching comes into play. We start by choosing a conditioning variable \(z\) (independent of \(t\)) and express the probability path as a mixture of conditional distributions:
\[
p_t(x) = \mathbb{E}_{z \sim p_{\text{cond}}} \left[ p_t(x|z) \right] = \int p_t(x|z) p_{\text{cond}}(z) dz
\tag{7}\] where the conditional probability path \(p_t(x|z)\) should be chosen so that the marginal path \(p_t(x)\) satisfies the boundary conditions at \(t=0\) and \(t=1\), i.e., \(p_0\) matches the source noise distribution and \(p_1\) matches the target data distribution. For a given conditional probability path \(p_t(x|z)\) and its corresponding velocity field \(v_t(x|z)\), we define the conditional flow matching loss as:
I present here without proof the following result:
Important
Key Result: Regressing against the ground truth marginal velocity field \(v(t, x)\) is equivalent to regressing against the conditional velocity field \(v(t, x, z)\), i.e., \(\nabla_\theta \mathcal{L_{\text{CFM}}}(\theta) = \nabla_\theta \mathcal{L_{\text{FM}}}(\theta)\).
This means that by optimizing the conditional flow matching objective Equation 8, we arrive at the same solution as the flow matching objective Equation 6. Thus, we are able to learn the complex marginal velocity field \(v(t, x)\) only by having access to the simple conditional probability path \(p_t(x|z)\) and velocity field \(v_t(x|z)\).
We now turn our focus to designing these simple objects \(p_{\text{cond}}(z)\), \(p_t(x|z)\), and \(v_t(x|z)\). We explore one variant of many possible choices, namely straight paths from source to target samples.
Let the conditioning variable \(z = (x_0, x_1) \sim p_0 \times q\) be an independent pair of source and target data.
We consider Gaussian conditional probability paths that interpolate in a straight line between the source and target samples: \[
p_t(x|z:=(x_0, x_1)) = \mathcal{N}(x; tx_1 + (1-t)x_0, \sigma^2 I)
\] In order to fulfill the boundary conditions, we set \(\sigma = 0\), so the Gaussian distribution collapses to a Dirac delta distribution. \[
p_t(x|z:=(x_0, x_1)) = \delta_{tx_1 + (1-t)x_0}(x)
\tag{9}\]
The conditional velocity field (shown without proof) that generates the above probability path is quite simply the difference between the source and target samples. This makes a lot of sense, as we are just moving in a straight line from the source to the target sample. \[
v_t(x|z:=(x_0, x_1)) = x_1 - x_0
\tag{10}\]
We now have all the ingredients to train our desired CNF in a simple, scalable, simulation-free manner.
Calculate the loss \(\mathcal{L_{\text{CFM}}}(\theta) = \left\| v_\theta(t, x) - (x_1 - x_0) \right\|^2\)
Update \(\theta\) using gradient descent
Sampling Algorithm:
Sample \(x_0 \sim p_0\)
Integrate with the learned velocity field, e.g. with the Euler method for a desired number of steps \(K\)\[
x_{t+1} = x_t + \frac{1}{K} v_\theta(t, x_t)
\]
Demo
I show here a simple example of generating data from a target distribution composed of four 2D Gaussians, starting from a single Gaussian source. You can find the full code here. Figure 1 shows the conditional flow matching setup, where 500 samples are drawn from the source and target distributions. These samples are paired randomly together, and the straight-line paths between them are visualized. This is exactly the training signal that we will use to learn the velocity field. The idea of CFM is that learning to flow in a straight line between two random points will lead to a good enough aggregated velocity field that can be used to generate new samples.
Figure 4: Generated Trajectories from the Trained Flow Matching Model
From Figure 3 and Figure 4, we can see that the generated points closely follow the target distribution. We can also observe that the learned marginal velocity field does not always produce straight paths. This is a reasonable limitation of the original flow matching framework, as the learning signal comes only from random pairs of source and target samples. These random pairs produce crossing paths as in Figure 1, and you can think of the learned velocity field as the average direction of the paths crossing at a specific location. Several works (X. Liu et al., 2023; Pooladian et al., 2023; Tong et al., 2024) improve upon the original flow matching framework by learning straighter velocity fields, resulting in higher generation quality in fewer sampling steps.
Conclusion
Flow matching offers an elegant and intuitive approach to training generative models without needing to simulate expensive ODEs. By turning the problem into a simple conditional regression task, we are able to scale flow models to high-dimensional complex datasets of different modalities. In a subsequent post, I want to write about discrete flow matching, which transfers the idea of flow matching to the discrete domain and forms the backbone of the recently hot diffusion language models.
References
Albergo, M., & Vanden-Eijnden, E. (2023). Building normalizing flows with stochastic interpolants. International Conference on Learning Representations. https://openreview.net/forum?id=li7qeBbCR1t
Bose, J., Akhound-Sadegh, T., Huguet, G., FATRAS, K., Rector-Brooks, J., Liu, C.-H., Nica, A. C., Korablyov, M., Bronstein, M. M., & Tong, A. (2024). SE (3)-stochastic flow matching for protein backbone generation. International Conference on Learning Representations. https://openreview.net/forum?id=kJFIH23hXb
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al. (2024). Scaling rectified flow transformers for high-resolution image synthesis. International Conference on Machine Learning. https://openreview.net/forum?id=FPnUhsQJ5B
Jing, B., Berger, B., & Jaakkola, T. (2024). AlphaFold meets flow matching for generating protein ensembles. International Conference on Machine Learning. https://openreview.net/forum?id=rs8Sh2UASt
Le, M., Vyas, A., Shi, B., Karrer, B., Sari, L., Moritz, R., Williamson, M., Manohar, V., Adi, Y., Mahadeokar, J., & Hsu, W.-N. (2023). Voicebox: Text-guided multilingual universal speech generation at scale. Advances in Neural Information Processing Systems. https://papers.nips.cc/paper/2023/hash/2d8911db9ecedf866015091b28946e15-Abstract.html
Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow matching for generative modeling. International Conference on Learning Representations. https://openreview.net/forum?id=PqvMRDCJT9t
Liu, A. H., Le, M., Vyas, A., Shi, B., Tjandra, A., & Hsu, W.-N. (2024). Generative pre-training for speech with flow matching. International Conference on Learning Representations. https://openreview.net/forum?id=KpoQSgxbKH
Liu, X., Gong, C., & Liu, Q. (2023). Flow straight and fast: Learning to generate and transfer data with rectified flow. International Conference on Learning Representations. https://openreview.net/forum?id=XVjTT1nw5z
Polyak, A., Zohar, A., Brown, A., Tjandra, A., Sinha, A., Lee, A., Vyas, A., Shi, B., Ma, C.-Y., Chuang, C.-Y., Yan, D., Choudhary, D., Wang, D., Sethi, G., Pang, G., Ma, H., Misra, I., Hou, J., Wang, J., & Du, Y. (2025). Movie gen: A cast of media foundation models. https://arxiv.org/abs/2410.13720
Pooladian, A.-A., Ben-Hamu, H., Domingo-Enrich, C., Amos, B., Lipman, Y., & Chen, R. T. (2023). Multisample flow matching: Straightening flows with minibatch couplings. International Conference on Learning Representations. https://openreview.net/forum?id=mxkGDxWOHS
Prajwal, K., Shi, B., Le, M., Vyas, A., Tjandra, A., Luthra, M., Guo, B., Wang, H., Afouras, T., Kant, D., et al. (2024). MusicFlow: Cascaded flow matching for text guided music generation. International Conference on Machine Learning. https://openreview.net/forum?id=kOczKjmYum
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-based generative modeling through stochastic differential equations. International Conference on Learning Representations. https://openreview.net/forum?id=PxTIG12RRHS
Tong, A., FATRAS, K., Malkin, N., Huguet, G., Zhang, Y., Rector-Brooks, J., Wolf, G., & Bengio, Y. (2024). Improving and generalizing flow-based generative models with minibatch optimal transport. Transactions on Machine Learning Research. https://openreview.net/forum?id=CD9Snc73AW
Vyas, A., Shi, B., Le, M., Tjandra, A., Wu, Y.-C., Guo, B., Zhang, J., Zhang, X., Adkins, R., Ngan, W., et al. (2023). Audiobox: Unified audio generation with natural language prompts. https://arxiv.org/abs/2312.15821