More than 1 year has passed since last update.

Light LoRAについて

Posted at 2023-07-27

0. Light LoRAについて

HyperDreamBoothのアイデアにLightweight DreamBoothというLoRAよりも更にパラメータ数を削減するアイデアが書かれている。

従来のLoRAは$W_0=(768,320)$の行列に対して、これの差分行列$W=W_0 + \Delta W$を考える時、$\Delta W$を$(768,4)$と$(4,320)$の2層の差分モデルで考える。
出力次元は$320,640,1280$のいずれかでこれはUnetの解像度に依存する。
入力次元は$K,V$のCross-AttentionであればTextEncoderの次元で$768$次元。$Q,Out$のCross-AttentionかSelf-Attentionであれば出力次元と等しい。

Lightweight Dreambooth(Light LoRA)は以下の様に4層のモデルを考える。その内、2層は凍結した重みである。これは例えば$W_0=(768,320)$の行列に対して$\Delta W$を$(768,100)$と$(100,4)$と$(4,50)$と$(50,320)$の4つの行列で計算し、その内、$(768,100)$と$(50,320)$は初期重みで重みを凍結する。学習パラメータは入出力次元に依らず$(100,4)$と$(4,50)$だけでよく、このパラメータはLoRAモデルの1/10以下である。

lightLoRAの実装例は

にあるが、down_aux、up_auxの重みを学習時に毎回代入しているように見える。
また、Hypernetworkで画像入力からdownとupの重みを予測するためか、downとup重みを結合している。
上述の実装手法を参考にしつつ単にLightLoRAを学習するだけの実装してみる。

1. up_loraだけzero行列で初期化

以下の様に4つの行列を考えて、そのうちdown_aux、up_auxの重みをorthogonal_重みで固定する。
C:\ProgramData\Anaconda3\Lib\site-packages\diffusers\models\attention_processor.pyのLoRALinearLayerを改造する。
あとはdiffusers/examples/dreambooth/train_dreambooth_lora.pyで学習させた。

class LoRALinearLayer(nn.Module):
    def __init__(self, in_features, out_features, rank=4, network_alpha=None):
        super().__init__()
...
        self.down_aux = nn.Linear(in_features, 100, bias=False)
        self.down = nn.Linear(100, rank, bias=False)
        self.up = nn.Linear(rank, 50, bias=False)
        self.up_aux = nn.Linear(50, out_features, bias=False)

        self.network_alpha = network_alpha
        self.rank = rank

        seed = 0
        torch.manual_seed(seed)
        nn.init.orthogonal_(self.down_aux.weight, gain=1)
        nn.init.orthogonal_(self.down.weight, gain=1)
        nn.init.zeros_(self.up.weight)
        nn.init.orthogonal_(self.up_aux.weight, gain=1)

        self.down_aux.weight.requires_grad=False
        self.up_aux.weight.requires_grad=False

    def forward(self, hidden_states):
        orig_dtype = hidden_states.dtype
        dtype = self.down.weight.dtype

        down_hidden_states = F.linear(hidden_states.to(dtype), self.down.weight @ self.down_aux.weight)
        up_hidden_states = F.linear(down_hidden_states, self.up_aux.weight @  self.up.weight)

        if self.network_alpha is not None:
            up_hidden_states *= self.network_alpha / self.rank
        return up_hidden_states.to(orig_dtype)

load_lora_weightsは上手く動かないので以下の様にしてLight LoRAを従来LoRAの重みに変換してモデルに読み込む。

保存される重みはlora.down_aux.weight, lora.up_aux.weight, lora.down.weight, lora.up.weightの4種なので生成時に行列積を計算してlora.down.weight, lora.up.weightを計算する。実際にはdown_auxとup_aux重みは不要だが確認のため保存する。

更に余計な要素を削除するため
state_dict = {module_name: param for module_name, param in state_dict.items() if 'lora.up.weight' in module_name or 'lora.down.weight' in module_name}で
state_dict内のlora.down.weight, lora.up.weight以外の要素を削除する。

最後にpipe.unet.load_attn_procs(state_dict)でLoRA重みは適用されるはず。

model_id = "./cat_1_TI/"
lora_path = './cat_1_TI_Li/pytorch_lora_weights.bin'
ddim = DDIMScheduler.from_pretrained(model_id, subfolder="scheduler")
pipe = StableDiffusionPipeline.from_pretrained(model_id,torch_dtype=torch.float16,scheduler=ddim).to("cuda")

state_dict = torch.load(lora_path, map_location="cpu")

for name in list(state_dict.keys()):
    if 'unet' in name:
        if 'down_blocks.0' in name:
            hidden_dim = 320
        if 'down_blocks.1' in name:
            hidden_dim = 640
        if 'down_blocks.2' in name:
            hidden_dim = 1280
        if 'up_blocks.1' in name:
            hidden_dim = 1280
        if 'up_blocks.2' in name:
            hidden_dim = 640
        if 'up_blocks.3' in name:
            hidden_dim = 320
        if 'mid_block' in name:
            hidden_dim = 1280

        if 'attn2.processor' in name:
            # Cross-Attention
            if 'to_k' in name or 'to_v' in name:
                input_dim = 768
            else:
                input_dim = hidden_dim
        else:
            # Self-Attention
            input_dim = hidden_dim

        output_dim = hidden_dim
        
        embedding1 = nn.Embedding(input_dim, 100)
        embedding2 = nn.Embedding(50, output_dim)
        
        seed = 0
        torch.manual_seed(seed)
        nn.init.orthogonal_(embedding1.weight.T, gain=1)
        nn.init.orthogonal_(embedding2.weight.T, gain=1)
        
        if 'lora.down_aux.weight' in name:
            print(embedding1.weight.T)
            print(state_dict[name])
        if 'lora.up_aux.weight' in name:
            print(embedding2.weight.T)
            print(state_dict[name])

        if 'lora.down.weight' in name:
            state_dict[name] = state_dict[name] @ embedding1.weight.T
            print(name, state_dict[name].shape)
        if 'lora.up.weight' in name:
            state_dict[name] = embedding2.weight.T @ state_dict[name]
            print(name, state_dict[name].shape)

state_dict = {module_name: param for module_name, param in state_dict.items() if 'lora.up.weight' in module_name or 'lora.down.weight' in module_name}

pipe.unet.load_attn_procs(state_dict)

2. down_loraだけzero行列で初期化

down_loraだけzero行列で初期化する。

class LoRALinearLayer(nn.Module):
    def __init__(self, in_features, out_features, rank=4, network_alpha=None):
        super().__init__()
...

        nn.init.orthogonal_(self.down_aux.weight, gain=1)
        nn.init.zeros_(self.down.weight)
        nn.init.orthogonal_(self.up.weight, gain=1)
        nn.init.orthogonal_(self.up_aux.weight, gain=1)

3. 両方zero行列で初期化

up_lora, down_lora共にzero行列で初期化する。

class LoRALinearLayer(nn.Module):
    def __init__(self, in_features, out_features, rank=4, network_alpha=None):
        super().__init__()
...

        nn.init.orthogonal_(self.down_aux.weight, gain=1)
        nn.init.zeros_(self.down.weight)
        nn.init.zeros_(self.up.weight)
        nn.init.orthogonal_(self.up_aux.weight, gain=1)

4. 学習行列を結合して学習、特異値分解で分割

down_loraとup_loraを一個の行列$(100,50)$として考えaux行列は固定する。これを画像生成時に特異値分解して$U\Gamma V^T=(100,4), (4,4), (4,50)$から$(100,4), (4,50)$にわけ、aux行列を掛けてdown_loraとup_loraに戻す。

class LoRALinearLayer(nn.Module):
    def __init__(self, in_features, out_features, rank=4, network_alpha=None):
        super().__init__()
...
        self.down_aux = nn.Linear(in_features, 100, bias=False)
        self.down_up = nn.Linear(100, 50, bias=False)
        self.up_aux = nn.Linear(50, out_features, bias=False)
...
        nn.init.orthogonal_(self.down_aux.weight, gain=1)
        nn.init.zeros_(self.down_up.weight)
        nn.init.orthogonal_(self.up_aux.weight, gain=1)

        self.down_aux.weight.requires_grad=False
        self.up_aux.weight.requires_grad=False
...
    def forward(self, hidden_states):
        orig_dtype = hidden_states.dtype
        dtype = self.down_up.weight.dtype

        x = self.down_aux(hidden_states.to(dtype))
        x = self.down_up(x)
        up_hidden_states = self.up_aux(x)
...

state_dict = torch.load(lora_path, map_location="cpu")
for name in list(state_dict.keys()):
...
        if 'lora.down_up.weight' in name:
            M = state_dict[name]
            U, S, Vh = torch.linalg.svd(M.to(torch.float32))
            rank = 4

            U = U[:, :rank]
            S = S[:rank]
            Vh = Vh[:rank, :]

            U = U @ torch.diag(S)
            Vh = Vh

            state_dict[name.replace('lora.down_up.weight', 'lora.down.weight')] = Vh @ embedding1.weight.T
            state_dict[name.replace('lora.down_up.weight', 'lora.up.weight')] = embedding2.weight.T @ U

5. normalで初期化

normalをorthogonalの代わりに使った時、lossは下がらず学習は上手く行かなかった。

class LoRALinearLayer(nn.Module):
    def __init__(self, in_features, out_features, rank=4, network_alpha=None):
        super().__init__()
...

        nn.init.normal_(self.down_aux.weight, std=1)
        nn.init.normal_(self.down.weight, std=1 / rank)
        nn.init.zeros_(self.up.weight)
        nn.init.normal_(self.up_aux.weight, std=1)

結果

TI学習の結果から1～4のLight LoRAの学習した結果を示す。
学習率は2.0e-03、学習stepは500。
学習率が大きい場合、fp16の学習に失敗する場合があって以降--mixed_precision="no"とした。

1.up_loraのみ、2.down_loraのみゼロ行列で初期化した結果はLight LoRAでも学習出来ているように見える。一方、3.両方ゼロ行列で初期化した結果は学習出来ていなかった。
4.学習後に特異値分解で次元を落としたデータでも上手く学習出来ているようには見える。

0. TI学習のみ(学習初期)

0. TI+従来LoRA

1. TI+LiLoRA(up_loraだけzero行列で初期化)

2. TI+LiLoRA(down_loraだけzero行列で初期化)

3. TI+LiLoRA(両方zero行列で初期化)

0.のTI学習のみの結果と変わらない。

4. TI+LiLoRA(特異値分解で分割)

再現度は従来LoRAが最も高そうには見える。個人的には特異値分解で分割もその次に良く見える。

LiLoRAの大きさ

rank=4の時のLoRA出力ファイルの大きさを示す。
モデル保存時にdown_aux, up_auxを保存するとLoRAサイズは59MBで却って大きくなる。
また4.の$(100,50)$でdown_upを保存した場合は61MBであった。
down_aux, up_auxを削除すると0.4MBまで小さくなる。サイズ感的にはXTIのモデルサイズと張り合えるようになる。

モデル重み	サイズ
TI(1token)	4KB
TI(8token)	25KB
XTI(8*16token)	400KB？
従来LoRA	3,211 KB
LiLoRA(down+up)	379 KB
LiLoRA(down+up+down_aux+up_aux)	58,844 KB
LiLoRA(down_up+down_aux+up_aux)	60,995 KB

まとめ

Light LoRA(Lightweight Dreambooth)について確認してみた。
自分でやった限りでは両方ゼロ行列で初期化した場合、何故か上手く行かなかった。また、orthogonalの初期化は意外に重要なようだった。
Light LoRAはHyperDreamBoothの構成の一要素に過ぎないため、その他は自分はよく分かってない。そのほかHypernetworkはStableDiffusionなら別の構造を指すので名称的に良くないと思った。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up