Current denoising-based generation models consist of three main components:
- Forward Process
- Sampling
- Model Prediction
Forward Process
The forward process can be understood as an ODE:
\[dx_{t} = f(x_t, t) \, dt\]
If we add stochastic perturbation, the ODE is transformed into a SDE,
\[dx_{t} = f(x_t, t) \, dt + g(t) \, d\mathbf{w}\]
Here, \(d\mathbf{w}\) represents a Wiener process, where \(\mathbf{w_t} \sim \mathcal{N}(0, t)\). Therefore, \(d\mathbf{w} \sim \mathcal{N}(0, dt)\)
This can be approximated as:
\(d\mathbf{w} = \sqrt{dt} \; \epsilon\) , where \(\epsilon \sim \mathcal{N}(\mu, \sigma^2)\)
Different denoising methods define their own specific functions for \(f(x_t, t)\) and \(g(t)\)
Sampling
By adding noise during the forward process, we train the model to reconstruct the distribution of the training data. During inference, this allows us to sample an image or video from the learned distribution. The sampling can be either deterministic or stochastic. A common equation is as:
\[x_{t + \Delta t} \sim \mathcal{N}(x_t + f(x_t, t) \Delta t, g^2(t) \Delta t)\]
The reverse process expects to compute the posterior distribution \(p(x_t \| x_{t + \Delta t})\)
Model Prediction
Which objects to learn. It can be score function, velocity, initial data and noises.
DDPM
1.1 Forward Process
The forward process is given by:
\[x_t = \sqrt{1-\beta_t} x_{t-1} + \sqrt{\beta_t} \varepsilon\]
This can also be written as:
\[x_t = \sqrt{\bar{\alpha_t}} x_{data} + \sqrt{1 - \bar{\alpha_t}} \varepsilon\]
We can express this in the form of a SDE \(^{\text{ref-Section D}}\) :
\[dx_t = f(x_t, t) dt + g(t) d\mathbf{w} = -\frac{1}{2} \beta(t) x_t \;dt + \sqrt{\beta(t)} d\mathbf{w}\]
We can clearly see from the equation that DDPM forward process is a curve motion, where the magnitude and direction of velocity is time-dependent.
1.2 Sampling
The reverse process is expressed as:
\[dx_t = \left(f(x_t, t) - g^2(t) \nabla_x \log p(x_t)\right) dt + g(t) d\bar{\mathbf{w}}\]
\(\bar{\mathbf{w}}\) is a reverse Wiener process.
This process can be solved using any SDE solver you like.
1.3 Model Prediction
\[v := \sqrt{\bar{\alpha_t}} \epsilon - \sqrt{1-\bar{\alpha_t}} x_{data}\]
\[\epsilon_{pred} = \sqrt{\bar{\alpha_t}} v_{pred} + \sqrt{1-\bar{\alpha_t}} x_t\]
\[\bar{\alpha_t} \text{MSE}(v_{pred}, v) = \text{MSE}(\varepsilon_{pred}, \varepsilon)\]
- Convert to Predict \(x_{data}\)
\[\hat{x_0} = \sqrt{\bar{\alpha_t}} x_t - \sqrt{1-\bar{\alpha_t}} v_{pred}\]
\[(1-\bar{\alpha_t}) \text{MSE}(v_{pred}, v) = \text{MSE}(x_{data}, x_{pred})\]
Score Matching
2.1 Forward Process
The forward process in score matching is:
\[x_t = x_{data} + \sigma_t \varepsilon\]
Here, \(x_{data} \sim p_{data}(x)\) , where \(p_{data}(x)\) represents the distribution of the training dataset. The noise variance decreases over time:
\[\sigma_1 > \sigma_2 > \sigma_3 > \dots\]
This means the noise variance added to the data gradually decreases. The corresponding ODE is:
\[dx_t = \sqrt{\frac{d\sigma^2_t}{dt}} d\bar{\mathbf{w}}\]
The forward process can be imagined a straight line going from data to noise where the velocity (variance of noise) gradually decreases from large to small.
2.2 Sampling
\[d\mathbf{x_t} = -\left(\frac{d[\sigma(t)^2]}{dt} \nabla_{\mathbf{x}} \log p(\mathbf{x_t}) \right) dt + \sqrt{\frac{d[\sigma(t)^2]}{dt}} d\bar{\mathbf{w}}\]
The reverse sampling follows the Langevin equation:
\[x_{t+1} = x_t + \tau \nabla_x \log p(x_t) + \sqrt{2\tau} z\]
where \(z \sim N(0, I)\)
We can see from the sampling equation that although the forward process is linear, the reverse process is stochastic, which makes score-matching sampling hard to hack.
2.3 Model Prediction
\[J_{\text{NCSM}}(\theta, \sigma_i) = \sum_i^{L} \lambda_i \;\mathbb{E}_{p(\mathbf{x})} \left[ \frac{1}{2} \left\| s_\theta(x_{data} + \sigma_i \varepsilon) + \frac{\varepsilon}{\sigma_i} \right\|^2 \right]\]
Note this is the loss function when we use score matching forward process to add noise progressively.
If we use DDPM noise scheduler, which means \(x_t = \sqrt{\bar{\alpha_t}} x_{data} + \sqrt{1 - \bar{\alpha_t}} \varepsilon\) holds true,
Then, the score can be approximated by \(\varepsilon\)
\[\nabla_x \log p(x_t \|x_{data}) = -\frac{x_t-\sqrt{ \bar{ \alpha_t} } x_{data}}{1-\bar{\alpha_t}} = -\frac{\varepsilon}{\sqrt{1-\bar{\alpha_t}}}\]
We can claim that:
\(x_t = \sqrt{\bar{\alpha_t}} x_{data} - (1 - \bar{\alpha_t}) s_{\theta}\) holds true. \(s_{\theta}\) is the score predicted by neural network.
Flow Matching
3.1 Forward Process
In flow matching, the forward process is:
\(x_t = (1 - t) x_{data} + t \varepsilon = a_t x_{data} + b_t \varepsilon\), where \(t \in [0, 1]\)
Flow matching can be regarded as a uniform linear motion between data and noise.
3.2 Sampling
The reverse ODE is:
\[\frac{dx_t}{dt} = \varepsilon - x_{data} = v_t(x)\]
You can solve this ODE using Euler’s method. An interesting fact is that the direction of the velocity in Rectified Flow is from noise to data; whereas in DDPM-v-pred, it is from data to noise.
3.3 Model Prediction
The objective function for flow matching \(^{\text{ref-Section 2}}\); \(^{\text{ref-Theorem 3}}\) is:
\[\text{MSE}(v(x_{data}, t) - u(x_t| \varepsilon))\]
where \(u(x_t \| \varepsilon) = \frac{a'_t}{a_t} x_t - \frac{b_t}{2} \left(\log \frac{a^2_t}{b^2_t}\right)' \varepsilon\)