More than 1 year has passed since last update.

SDXL 1.0の画像生成例

Posted at 2023-08-18

StableDiffusion XL 1.0のpromptの生成例を他の画像生成モデルのprompt例から生成してみる。
今回試したのはeDiff-I、Imagen、Parti、Museの論文内にあるPrompt例である。

生成コード

上記参考。
ただし、自分の環境だとGPUが8GBではメモリが足りなく、エラーが出るので以下が必要。

#pipe.to("cuda")
pipe.enable_model_cpu_offload()

eDiff-Iのprompt

SDXLの結果を示す。Baseのみ、Refinerなし。infer_step=50。入力prompt以外初期値。

'A photo of a raccoon wearing a brown sports jacket and a hat. He is holding a whip in his hand'

大体描けてる。鞭の形が微妙だが大きく問題はない。

'A photo of a red parrot, a blue parrot and a green parrot singing at a concert in front of a microphone. Colorful lights in the background.'

赤、青、緑の三羽のオウムは確率で描けている。

'A photo of a cute corgi wearing a beret holding a sign that says "Diffusion Models". There is Eiffel tower in the background.'

ベレー帽の生成率が低く、看板を手に持つ確率も低い。"Diffusion Models"もスペルミスが多い。

'A photo of a lion and a panda competing in the Olympics swimming event.'

パンダとライオンの書き分けはたまに混ざるが、大抵は問題ない。

eDiff-I出力

'A photo of two cute teddy bears sitting on top of a grizzly bear in a beautiful forest. Highly detailed fantasy art, 4k, artstation'

「sitting on top of」という意味を理解しておらず、生成位置に問題がある。

'There are two Chinese teapots on a table. One pot has a painting of a dragon, while the other pot has a painting of a panda.'

急須に龍の絵は描かれるがパンダの絵はあまりない。

'A photo of two squirrel warriors dressed as knights fighting on a battlefield. The squirrel on the left holds a stick, while the squirrel on the right holds a long sword. Gray clouds.'

鎧のリスは描かれるが、武器の指定(左が棒で右が剣)は出来ていない。

'A photo of a dog wearing a blue shirt and a cat wearing a red shirt sitting in a park, photorealistic dslr.'

シャツの色は指定通り（犬が青）に描かれる事もあるが失敗する（犬が赤）事もある。確率的に約半数ということは単なる偶然だろうか。
ちなみにdslrはデジタル一眼レフカメラという意味。

eDiff-I出力

'A photo of two monkeys sitting on a tree. They are holding a wooden board that says "Best friends", 4K dslr.'

板を持つ確率は低い。かろうじて一枚"BEST"っぽい板を持っている絵があるが。

'A photo of a golden retriever puppy wearing a green shirt. The shirt has text that says "NVIDIA rocks". Background office. 4k dslr.'

"NVIDIA rocks"は割合描けている。

'An ice sculpture is made with the text "Happy Holidays". Christmas decorations in the bacground. Dslr photo.'

"Happy Holidays"はかなり描けている。

'A dslr photo of a colorful Graffiti on a wall with the text "Peace love". There is a painting of a bull dog with sunglasses next to it.'

サングラスのブルドッグは描けているが、"Peace love"は見えない。

eDiff-I出力

'A 4K dslr photo of a hedgehog sitting in a small boat in the middle of a pond. It is wearing a Hawaiian shirt and a straw hat. It is reading a book. There are a few leaves in the background.'

本を読んでいる、アロハシャツ、麦わら帽の特徴は部分的に表れている。

'A fantasy landscape on an alien planet in which there are many buildings. There is a beautiful bridge with a pond in the center. There is one large moon in the sky. The sky is orange. Digital art, artstation.'

特に違いは分からない。

'A close-up 4k dslr photo of a cat riding a scooter. It is wearing a plain shirt and has a bandana around its neck. It is wearing a scooter helmet. There are palm trees in the background.'

スクーター、首にスカーフ、頭にヘルメット、体はシャツを全て満たしているのは少ない。

'A photo of a plate at a restaurant table with spaghetti and red sauce. There is sushi on top of the spaghetti. The dish is garnished with mint leaves. On the side, there is a glass with a purple drink, photorealistic, dslr.'

スパゲッティの上に寿司が乗ってない。

eDiff-I出力

'An origami of a monkey dressed as a monk riding a bike on a mountain.'

「僧の格好の折り紙のサル」は平面的ではなく立体的に描かれている。

'A 4k dslr photo of two teddy bears wearing a sports jersey with the text "eDiffi" written on it. They are on a soccer field.'

"eDiffi"のスペルミスは多いが、それ以外の特徴は問題ない。細かい事を言うとサッカー場じゃなくてラグビー場に見える。

'A photo of a lemur wearing a red magician hat and a blue coat in a garden. The lemur is performing a magic trick with cards.'

赤い帽子と青いコートの組み合わせは高い確率で描けている。

eDiff-I出力

Imagenのprompt

ここから論文中の出力は転載しないので元論文を参照ください。

'Hovering cow abducting aliens.'

問題なし。

'Greek statue of a man tripping over a cat.'

問題なし。

'yellow book and a red vase.'

黄色い本と赤い花瓶の指定は出来てない。Imagenは出来る。

'A black apple and a green backpack.'

黒いリンゴと緑のリュックの指定は出来てない。Imagenは出来る。

'A horse riding an astronaut.'

引っかけ問題。「宇宙飛行士に乗る馬」は書けない。

'A panda making latte art.'

問題なし。
DALL-E2だと「パンダのラテアート」が描かれる。

'A couple of glasses are sitting on a table.'

問題なし。
DALL-E2だと「眼鏡」が描かれる。

'A cube made of brick. A cube with the texture of brick.'

レンガで作られた立方体はあまり書けてない。

'New York Skyline with Hello World written with fireworks on the sky.'

Imagenの出力に比べると、文字の精度は微妙。

'A storefront with Text to Image written on it.'

看板に「Text to Image」と書けていない。

'A storefront with "Text to Image" written on it.'

「Text to Image」の前後を"で囲うと改善した。

Partiのprompt

'A portrait photo of a kangaroo wearing an orange hoodie and blue sunglasses standing on the grass in front of the Sydney Opera House holding a sign on the chest that says Welcome Friends!'

出力を選べば「Welcome Friends」は書ける。

'A green sign that says "Very Deep Learning" and is at the edge of the Grand Canyon. Puffy white clouds are in the sky.'

「Very Deep Learning」は誤字多い。「valley」との混同を狙っているのだろうか。
SDXLの論文12ページの出力よりも平均で悪い気がする。

'A blue Porsche 356 parked in front of a yellow brick wall.'

ポルシェ356を知らないので合っているか不明。

'A photo of an astronaut riding a horse in the forest. There is a river in front of them with water lilies.'

問題なし。

'A map of the United States made out of sushi. It is on a table next to a glass of red wine.'

問題なし。

Museのprompt

'A high contrast portrait of very happy fuzzy panda dressed as a chef in a high end kitchen marking dough. There is painting of flowers on the wall behind him.'

問題ないがごちゃごちゃしている印象。

'Rainbow coloured penguin.'

問題なし。colorはアメリカ英語で、colourはイギリス英語らしい。

'A stack of 3 books. A green book is on the top, sitting on a red book. The red book is in the middle sitting on a blue book. The blue book is on the bottom.'

位置の指定は難しい。

'A real flamingo reading a large open book. a big stack of is piled up next to it. dslr photograph'

問題なし。

'A storefront with "Google Research Cafe" written on it.'

前述の'A storefront with Text to Image written on it.'と類似。色合いが「Google」っぽい。

その他：

ほか、いくつか試した。
論文中のベンチマークのPartiPromptsは試す余力がなかった。

'an armchair in the shape of an avocado'

問題なし。

'an illustration of a baby daikon radish in a tutu walking a dog'

「大根」が書けてない。

'a hedgehog using a calculator'

電卓が少し歪。

'a corgi wearing a red bowtie and a purple party hat'

「赤い蝶ネクタイ」が書けてない。

'a stained glass window of a panda eating bamboo'

問題なし。

'a close up of a handpalm with leaves growing from it'

手の構造は再現出来てない。

'A teddybear on a skateboard in Times Square.'

問題なし。

まとめ

・一個のオブジェクトに対する細かい衣装の指定は2~3個程度なら問題ない。
・異なる二個のオブジェクト(犬と猫、ライオンとパンダ)はたまに特徴が混じる。
・等しい二個のオブジェクトに対して、同じ指定は反映されるが異なる指定は反映されにくい。
・複数オブジェクトの色の指定、位置の指定は難しい。
・文字を書く場合、""で囲うと改善する事がある。
・eDiff-I、ImagenのT5-XXL採用モデルの方がテキスト記述性能は高いが、SDXLはSD1よりは向上しているように見える。

SDXLへの個人的所感

以下はSDXLのモデルに対する個人の感想なので興味のない方は飛ばしてください。

SDXLはUnetの構成もSD1やSD2と異なっているらしいが、そこについてはあまりよく知らない。
BaseモデルとRefinerモデルで分かれているのはeDiff-Iにおけるexpert modelsなのか、SD2 upscaler的なものなのか分かってない。Refinerモデルの方がBaseモデルより若干Unetのサイズが小さいようである。

「学習画像サイズ」と「TextEncoder」に対する違いを追う。
・SD1-1
学習画像サイズは256x256。TextEncoderはCLIP(OpenAI)のTextEncoder(次元768)
・SD1-4
学習画像サイズは512x512。TextEncoderはCLIP(OpenAI)のTextEncoder(次元768)
・Novel AI(NAI…SD1-4系派生)
学習画像サイズは768x768+bucket。TextEncoderはCLIP(OpenAI)のTextEncoder(次元768)
・SD2
学習画像サイズは512x512, 768x768。TextEncoderはOpenCLIP(LAION)のTextEncoder(次元1024)
・SDXL
学習画像サイズは1024x1024+bucket。TextEncoderはCLIP(OpenAI)のTextEncoder(次元768)+OpenCLIP(LAION)のTextEncoder(次元1280)。合計次元2048。

OpenAIの開発したCLIPとLAIONの開発したOpenCLIPは別物である。
学習画像サイズとTextEncoderの次元には関連はない。
bucket学習（アスペクト比の異なる学習）は実は結構前(NAI)から提唱されていた。これを適用すると縦長の画像で頭部が途中で切れるという現象が起きにくくなる。
1024x1024+bucketは学習画像は512x2048～2048x512である。

eDiff-I、Imagen、Parti、MuseのTextEncoderを述べると
・eDiff-I
CLIP(OpenAI)とT5-XXL(Google)
・Imagen
T5-XXL(Google)のみ。
T5-XXLは次元4096相当のTransformerだが、完全にLLM(言語モデル)であってCLIPのような画像との紐づけはない。
・Parti
EncoderがTextEncoder相当。
次元は4096だがトータルパラメータは20Bでfp32だと80GBのモデルサイズ。
従ってフルモデルを一般PCでは動かすのは困難と思われる。

・Muse
T5-XXL(Google)のみ。
生成速度はParti-3BやLDM(SD)よりも速い。

一方でFIDではMuseはSDを上回っているものの、Imagen、Parti、eDiff-Iより劣る。

・eDiff-I、Imagen、MuseでいずれもT5-XXLのTextEncoderの使用が見られたので、SDXLでも使用されるのかと予想していたが、自分の予想に反してT5-XXLは使われなかった。
・SD2.0のフィルタリングした性能低下もあって、OpenCLIP自体を一時期懐疑的に思ったが、SDXLの結果で見る限りは今のところは良いようだ。
・優秀なTextEncoderは「猫の特徴は四足歩行で、猫耳で、ひげがあり、毛がふさふさで、毛色は…」のようにテキストを多くの特徴量に分解できるTextEncoderであると思っている。
個人的にCLIPの精度が等しいならImageEncoderが賢ければ賢いほど(画像から特徴量抽出して更に余った部分で分類問題も学習されるため)、CLIPのTextEncoderは相対的に馬鹿になる（テキストの単語の特徴量を分解できず。単語のトートロジーしか返さない(猫の特徴は猫である)）のではと思っている。
SDXLに使われるOpenCLIPのImageEncoderはViT-bigGだがImageEncoderが優秀だからと言ってTextEncoderが優秀とは限らないのではないか。
・現時点の最強LLMはChatGPT4(パラメータ数は220B×8)だと思うが、これ（または相当モデル）をTextEncoderとして無料で使える日は来るのだろうか。
ちなみにGPT3(Davinci)の時点でパラメータは175B(次元12288)のサイズである。GPT3(Curie)はパラメータ6.7B(次元4096)でT5-XXL(5.5B、次元4096相当)と大差ない。2023/08時点ではLlama2(70B)が公開されてるLLMで最大だと思われる。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up