More than 1 year has passed since last update.

StableDiffusion3の画像生成例

Posted at 2024-06-14

SDXLとDALL-E3の結果は下記参照。

Comfy UIを使って作成した。
①Comfy UIをgithubの「Direct link to download」からダウンロードして解凍する。
②StableDiffusion3のhuggingfaceからsd3_medium_incl_clips_t5xxlfp8.safetensorsをダウンロードしてきて上記のComfyUI_windows_portable\ComfyUI\models\checkpointにおく。
③ComfyUI_windows_portable\run_nvidia_gpu.batを実行する。初期起動にしばらくかかる。
④ComfyUI Managerを導入(あまり分かってない)
⑤sd3のcomfy_example_workflows_sd3_medium_example_workflow_basic.jsonをComfyUIのloadから読み込む。
⑥TripleCLIPLoaderを削除して「Load CheckpointのCLIP」と「CLIP Text Encode」を繋げる。
また複数枚作成するには、add Node：latent：batch：repeat latent batchを追加してlatent Imageの間においてseedをrandomizeにする。

生成時間は1070Ti(VRAM8GB)で8枚あたり25分くらい。

生成結果

yellow book and a red vase.

SDXLだと赤い本や黄色い花瓶が現れていたが問題が解消している。

A photo of a dog wearing a blue shirt and a cat wearing a red shirt sitting in a park, photorealistic dslr.

SDXLだと犬が赤いシャツを着る事が半分あったが問題が解消している。

There are two Chinese teapots on a table. One pot has a painting of a dragon, while the other pot has a painting of a panda.

SDXLだとパンダの急須は描かれなかった。

A stack of 3 books. A green book is on the top, sitting on a red book. The red book is in the middle sitting on a blue book. The blue book is on the bottom.

一部位置の把握は出来ている。一方で本の冊数は正しくない。DALL-E3の方が本の数は3～4で書けてはいた。

A photo of two squirrel warriors dressed as knights fighting on a battlefield. The squirrel on the left holds a stick, while the squirrel on the right holds a long sword. Gray clouds.

左が棒で右が剣の指定は上手く行ってない

An illustration of avocado sitting in therapist's chair, saying 'I just feel so empty inside' with a pitsized hole in its center. The therapist, a spoon, scribbles notes.

DALL-E3にあるprompt。アボガドに穴が開いてない。

A photo of two monkeys sitting on a tree. They are holding a wooden board that says "Best friends", 4K dslr.

SDXLより看板を持っている確率が高い。

A photo of a cute corgi wearing a beret holding a sign that says "Diffusion Models". There is Eiffel tower in the background.

SDXLだとベレー帽はあまり被ってなかった。主観だがコーギーが可愛くない。

A horse riding an astronaut.

「宇宙飛行士に乗った馬」は描けてない。
あと何か彩度が微妙。初期設定(dpm++2m、28step、cfg4.5)のままだが。

An origami of a monkey dressed as a monk riding a bike on a mountain.

自転車に乗る姿は平面的である。SDXLの方が立体的に描かれていた。

A photo of a plate at a restaurant table with spaghetti and red sauce. There is sushi on top of the spaghetti. The dish is garnished with mint leaves. On the side, there is a glass with a purple drink, photorealistic, dslr.

寿司っぽいのが載ってるが微妙。DALL-E3の方が寿司っぽい。

an illustration of a baby daikon radish in a tutu walking a dog

SDXLと比べ犬の散歩は描けている。一方でDALL-E3では描けてた「大根」は描けていない。

a close up of a handpalm with leaves growing from it

相変わらず手に違和感ある。

Illustration of a doctor, whose lower body is a centaur, holding a pizza and a fire extinguisher. He has a Cerberus and a Harpy.

どこまで複雑な構図を書けるか試した。

Apples, gorillas, trumpets, and pineapples holding hands with each other in a circle.

手を繋げていない。DALL-E3ではリンゴとゴリラとパイナップルが輪になって手を繋いでいた。

Four-panel cartoon. In the first panel, Apple says "Hello world". In the second panel, a banana says, "You're welcome". In the third panel, a grape says, "Love and peace". In the fourth panel, a crow says, "Let's eat".

四コマ漫画を描こうとしているが微妙。DALL-E3に劣る。

Four-panel cartoon. In the first panel, A hungry fox walks through the woods. In the second panel, the fox found the grapes. In the third panel, the fox jumps up hard to eat the grapes, but he can't reach them. In the fourth panel, the fox says, "This grape must be sour".

四コマ漫画を描こうとしているが微妙。DALL-E3に劣る。

Natural history-style illustration of a monster with a lion's head, goat's body, and snake tail.

博物誌風に描けてない。DALL-E3の方が博物誌風に描けた。

an illustration of skeletal specimen of a centaur.

ケンタウロスの骨格標本は描けない。

The raccoon dog transforms into a Japanese tea kettle and performs a tightrope walking trick.

綱渡りしてるように見えない。DALL-E3は綱渡りしていた。

まとめ：

T5-XXLを追加したおかげでSDXLでは上手く出来なかった二個のオブジェクトの色指定は上手くできている。これは同じくLLMのT5追加したImagenやeDiff-Iでもそうだったため、T5追加の効果だと思われる。
T5-XXLのパラメータは全体で11B(encoder分で5.5B)だが、Llama2は7B/13B/70B、ChatGPT3.5が175Bである。T5-XXL(Flanでない方)はLlama2と比べ性能差がある。(単に人間の回答の好みに寄せただけかもしれないが)。
いずれにせよ「頭が赤い魚を食べる猫」などの文脈を正しく理解するには大型言語モデルが必要であり、最近はオープンなLLM開発が活発でなのもあって、T5モデルも時代遅れ感がある。またCLIPの後継にSigLIPというのもある。

一方、いくつかDALL-E3の出力に劣る場合があり、SD3 mediumの性能はまだDALL-E3に及んでいない。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up