Stable Diffusionとは

Posted at 2025-05-28

What is Stable Diffusion?

Stable Diffusion is the name of a powerful image generation model based on diffusion models. While the term may also refer to the platform or service used to generate images, technically it refers to the underlying model architecture that produces high-quality images from prompts. Users provide a text prompt and a random noise, and the model generates a high-resolution image through a process of iterative denoising.

Key Concepts

1. Inputs and Outputs

Input: Text prompt + random noise (Gaussian noise)
Output: A clean, high-resolution image

The model gradually converts noise into a coherent image that aligns with the input text prompt.

2. Diffusion Model Overview

A diffusion model learns to remove noise from an image.
The generation process is a reverse diffusion process: start from noise and iteratively denoise.
This process is repeated for a number of steps (typically ~50) until a clear image is produced.

Why Use Noise?

Noise is easy to generate using code, while meaningful images are complex and expensive to synthesize from scratch. By learning to reverse the corruption process (i.e., denoise), the model learns to create images from noise—a tractable learning problem.

Training the Diffusion Model

Step-by-Step:

Start from a clean image.
Add noise gradually over multiple steps (forward diffusion process).
At each step, record the noisy image and the noise added.
The model is trained to learn how to predict and remove noise at each step (reverse diffusion).

This training process frames image generation as a noise estimation problem.

Mathematical Formulation

Let:

$x_{0}$: original image
$x_{t}$: noisy image at timestep $t$
$\beta_{t}$: noise schedule (how much noise is added)
$\epsilon$: sampled Gaussian noise

The forward process is defined as:

$$
x_t = \sqrt{1 - \beta_t} \cdot x_{t-1} + \sqrt{\beta_t} \cdot \epsilon
$$

The model learns to estimate $\epsilon$ from $x_t$, then subtract it to reconstruct the image.

Model Architecture: U-Net

The core model for denoising is often a U-Net, which:

Compresses the input (noise) into a low-dimensional feature space
Expands it back into a denoised image
Injects time step information and optionally text embeddings at each level to guide generation

Adding Text Guidance

To generate images that match user prompts:

Text is encoded into vector form (using models like CLIP or T5)
The text embedding is injected into the U-Net via cross-attention
This allows the model to align the generated image with the semantic content of the prompt

Latent Diffusion Models (LDMs)

To reduce computation:

Instead of applying the denoising to full-resolution images, LDMs compress the image into a latent space using an encoder
Diffusion operates in this smaller latent space
After denoising, a decoder transforms the latent representation back into a full image

This approach, used in Stable Diffusion, dramatically reduces computational load while maintaining quality.

Controlling the Output

Text prompts guide the output image (text-to-image generation)
Initial images can also be encoded and used as inputs (image-to-image generation)
Fine-grained control is possible by adjusting the latent vectors or combining textual and visual embeddings

Efficient Training and Inference

Stable Diffusion supports:

LoRA (Low-Rank Adaptation): a technique for efficient fine-tuning with fewer trainable parameters
Quantization and offloading: making inference possible on GPUs with ~24GB of VRAM
Custom fine-tuning: enabling personalized image generation by training on user-specific datasets

Summary: Key Takeaways

Diffusion Models generate images by reversing a noise process.
Stable Diffusion enhances this by operating in latent space, using text/image embeddings to guide generation.
It supports lightweight inference, high image quality, and flexible customization.
The learning task is reframed as estimating noise, making image generation both mathematically elegant and computationally feasible.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up