More than 1 year has passed since last update.

CLIPエンべディングを用いた画像による画像検索およびGPT4Vを用いた画像関連性の理由づけ

Last updated at 2024-02-23Posted at 2024-02-23

こちらをウォークスルーします。チュートリアル通りだと一部動かなかったので修正しています。

%pip install llama-index-multi-modal-llms-openai
%pip install llama-index-vector-stores-qdrant
%pip install llama-index-embeddings-clip
%pip install llama_index ftfy regex tqdm
%pip install git+https://github.com/openai/CLIP.git
%pip install torch torchvision
%pip install matplotlib scikit-image
%pip install -U qdrant_client
%pip install wikipedia
dbutils.library.restartPython()

Open AI APIキーはシークレットに格納しています。

# OpenAIのセットアップ
import os
import openai

OPENAI_API_TOKEN = dbutils.secrets.get("demo-token-takaaki.yayoi", "openai_api_key")
os.environ["OPENAI_API_KEY"] = OPENAI_API_TOKEN

Wikipediaから画像とテキストのダウンロード

Databricksランタイム14.0以降のクラスターを使用すると、ファイルをドライバーノードのローカルストレージではなく、ワークスペースファイルとして保存することができます。ファイルへのアクセスが容易になります。

import wikipedia
import urllib.request
from pathlib import Path


image_path = Path("mixed_wiki")
image_uuid = 0
# image_metadata_dict は画像のuuid、ファイル名、パスを含む画像のメタデータを格納します
image_metadata_dict = {}
MAX_IMAGES_PER_WIKI = 30

wiki_titles = [
    "Vincent van Gogh",
    "San Francisco",
    "Batman",
    "iPhone",
    "Tesla Model S",
    "BTS band",
]

# 画像のみを格納するフォルダの作成
if not image_path.exists():
    Path.mkdir(image_path)


# Wikiページの画像のダウンロード
# それぞれの画像に UUID を割り当て
for title in wiki_titles:
    images_per_wiki = 0
    print(title)
    try:
        page_py = wikipedia.page(title)
        list_img_urls = page_py.images
        for url in list_img_urls:
            if url.endswith(".jpg") or url.endswith(".png"):
                image_uuid += 1
                image_file_name = title + "_" + url.split("/")[-1]

                # 将来的には、img_pathは生の画像ファイルをポイントするS3パスでも構いません
                image_metadata_dict[image_uuid] = {
                    "filename": image_file_name,
                    "img_path": "./" + str(image_path / f"{image_uuid}.jpg"),
                }
                urllib.request.urlretrieve(
                    url, image_path / f"{image_uuid}.jpg"
                )
                images_per_wiki += 1
                # Wikiページごとにダウンロードする画像の数を15に制限
                if images_per_wiki > MAX_IMAGES_PER_WIKI:
                    break
    except:
        print(str(Exception("No images found for Wikipedia page: ")) + title)
        continue

Wikipediaの画像のプロット

from PIL import Image
import matplotlib.pyplot as plt
import os

image_paths = []
for img_path in os.listdir("./mixed_wiki"):
    image_paths.append(str(os.path.join("./mixed_wiki", img_path)))


def plot_images(image_paths):
    images_shown = 0
    plt.figure(figsize=(16, 9))
    for img_path in image_paths:
        if os.path.isfile(img_path):
            image = Image.open(img_path)

            plt.subplot(3, 3, images_shown + 1)
            plt.imshow(image)
            plt.xticks([])
            plt.yticks([])

            images_shown += 1
            if images_shown >= 9:
                break


plot_images(image_paths)

Wikipediaのテキストと画像の両方のインデックスを作成するためにマルチモーダルインデックスとベクトルストアを作成

from llama_index.core.indices import MultiModalVectorStoreIndex
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import SimpleDirectoryReader, StorageContext

import qdrant_client
from llama_index.core import SimpleDirectoryReader


# ローカルにQdrantベクトルストアを作成
client = qdrant_client.QdrantClient(path="qdrant_img_db")

text_store = QdrantVectorStore(
    client=client, collection_name="text_collection"
)
image_store = QdrantVectorStore(
    client=client, collection_name="image_collection"
)
storage_context = StorageContext.from_defaults(
    vector_store=text_store, image_store=image_store
)

# マルチモーダルインデックスを作成
documents = SimpleDirectoryReader("./mixed_wiki/").load_data()
index = MultiModalVectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
)

入力クエリー画像のプロット

input_image = "./mixed_wiki/2.jpg"
plot_images([input_image])

画像クエリーによるマルチモーダルインデックスからの画像取得

1. 画像による画像取得の結果

# テキスト取得結果の生成
retriever_engine = index.as_retriever(image_similarity_top_k=4)
# GPT4Vのレスポンスから更に情報を取得
retrieval_results = retriever_engine.image_to_image_retrieve(
    "./mixed_wiki/2.jpg"
)
retrieved_images = []
for res in retrieval_results:
    retrieved_images.append(res.node.metadata["file_path"])

# 入力画像が最も高い類似スコアを返すので、入力画像である最初の取得画像を除外
plot_images(retrieved_images[1:])

それらしい結果が得られています。

2. 入力画像に基づいてGPT4Vが理由づけを行い取得した画像

GPT4Vの呼び出しでエラーになる場合には、OpenAIのクレジットを確認してください。最近、Pay as you goから事前のチャージが必要になりました。

from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core import SimpleDirectoryReader
from llama_index.core.schema import ImageDocument

# ローカルディレクトリを指定します
image_documents = [ImageDocument(image_path=input_image)]

for res_img in retrieved_images[1:]:
    image_documents.append(ImageDocument(image_path=res_img))


openai_mm_llm = OpenAIMultiModal(
    model="gpt-4-vision-preview", api_key=OPENAI_API_TOKEN, max_new_tokens=1500
)
response = openai_mm_llm.complete(
    prompt="Given the first image as the base image, what the other images correspond to?",
    image_documents=image_documents,
)

print(response)

The images you've provided are all paintings that share a similar Post-Impressionist style, likely from the same time period. They all exhibit bold colors, dynamic brushstrokes, and a strong emotional component. The paintings seem to be by the same artist or by artists from the same movement. The style is reminiscent of Vincent van Gogh's work, known for his expressive and emotive use of color and form. Each painting depicts different subjects:

1. The first image shows a figure, possibly a farmer or a peasant, working near a tree with a large yellow sun in the background. The scene is likely set at dusk or dawn, given the prominence of the sun and the long shadows.

2. The second image depicts a man sitting at a table with a vase of sunflowers. The man appears to be in a contemplative or relaxed pose, and the painting captures a moment of still life combined with portraiture.

3. The third image is a landscape featuring a field with haystacks under a swirling sky with a crescent moon. The use of color and the swirling patterns in the sky are reminiscent of van Gogh's "Starry Night."

4. The fourth image is a vibrant scene of people working in a field with red and green foliage, and a large yellow sun in the sky. The painting captures the intensity of agricultural labor and the vivid colors of the landscape.

All four paintings are characteristic of the Post-Impressionist movement and share thematic elements of nature, human activity, and a sense of emotion conveyed through color and brushwork.

プロンプトを日本語にしてみます。

response = openai_mm_llm.complete(
    prompt="ベースイメージとして最初の画像が与えられた場合、他の画像はどのよう対応しますか？それぞれの画像を説明してください。",
    image_documents=image_documents,
)

これらの画像は、すべてビンセント・ヴァン・ゴッホの絵画のスタイルを持っているようです。彼の特徴的な筆のタッチと鮮やかな色使いが見て取れます。それぞれの画像について説明します。

最初の画像は、夕暮れ時の風景を描いています。太陽が大きく黄色く描かれており、木の枝が太陽に向かって伸びています。前景には、帽子をかぶった人物が木に寄りかかっているように見えます。この絵は、ヴァン・ゴッホの「夕暮れのサワー」という作品に似ています。

二番目の画像は、男性がテーブルに向かって座っている様子を描いています。テーブルの上には、花瓶に入ったひまわりがあり、男性はそのひまわりに手を伸ばしているように見えます。この絵は、ヴァン・ゴッホの「ひまわりを持つ画家の肖像」という作品を思わせます。

三番目の画像は、夜の風景を描いており、空には大きな赤い太陽があり、下には丘陵地帯が広がっています。この絵は、ヴァン・ゴッホの「夜のカフェテラス」や「星月夜」といった作品のスタイルを連想させますが、具体的な作品名を特定するのは難しいです。

四番目の画像は、農作業をしている人々の風景を描いています。夕日が空に浮かび、その光が地面や人々の服に反射しています。この絵は、ヴァン・ゴッホの「赤いぶどう畑」という作品に似ています。

これらの絵は、ヴァン・ゴッホの作品の特徴を捉えており、彼の表現主義的なスタイルと色彩の使用が際立っています。

すごい。

画像クエリーエンジンの活用

チュートリアル通りに動かしたら、こちらのエラーに遭遇したのでコードを修正しています。index.as_query_engineの引数をmulti_modal_llmではなくllmに。

from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core import PromptTemplate


qa_tmpl_str = (
    "Given the images provided, "
    "answer the query.\n"
    "Query: {query_str}\n"
    "Answer: "
)

qa_tmpl = PromptTemplate(qa_tmpl_str)


openai_mm_llm = OpenAIMultiModal(
    model="gpt-4-vision-preview", api_key=OPENAI_API_TOKEN, max_new_tokens=1500
)

query_engine = index.as_query_engine(
    llm=openai_mm_llm, image_qa_template=qa_tmpl
)

query_str = "Tell me more about the relationship between those paintings. "
response = query_engine.image_query("./mixed_wiki/2.jpg", query_str)
print(response)

The two paintings provided share a common artistic style that is characteristic of Post-Impressionism, a movement that emerged as a reaction against Impressionists' concern for the naturalistic depiction of light and color. Both paintings exhibit bold colors, dramatic, expressive brushwork, and a departure from accurate representation, which are hallmarks of Post-Impressionism.

The first painting features a night scene with a large yellow moon and a figure, which suggests an emotional and symbolic approach to the subject matter rather than a realistic portrayal. The second painting depicts a person with a vase of sunflowers, which also has a strong use of color and a focus on the emotional resonance of the subject.

While I cannot provide the names of the artists or the titles of the works due to the restrictions, it is evident that both paintings are by artists who were part of the Post-Impressionist movement, and they share similarities in technique and style that reflect the artists' interests in expressing their personal vision and emotional response to the world around them.

これも日本語でトライしてみます。

from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core import PromptTemplate


qa_tmpl_str = (
    "画像が指定されたら、 "
    "クエリーに回答してください。\n"
    "クエリー: {query_str}\n"
    "回答: "
)

qa_tmpl = PromptTemplate(qa_tmpl_str)


openai_mm_llm = OpenAIMultiModal(
    model="gpt-4-vision-preview", api_key=OPENAI_API_TOKEN, max_new_tokens=1500
)

query_engine = index.as_query_engine(
    llm=openai_mm_llm, image_qa_template=qa_tmpl
)

query_str = "これらの絵画の関係性について詳しく教えてください。 "
response = query_engine.image_query("./mixed_wiki/2.jpg", query_str)
print(response)

これらの絵画は、どちらもポスト印象派の画家、特にヴィンセント・ヴァン・ゴッホの作品と関連があります。最初の絵画はヴァン・ゴッホの「夜のカフェテラス」ではなく、太陽が描かれた風景で、彼の特徴的な筆使いと色彩が見て取れます。二番目の絵画は、ヴァン・ゴッホが描いた「ポール・ゴーギャンの肖像」です。これらの絵画は、ヴァン・ゴッホの強烈な色使いと感情的な表現が特徴で、彼の独自のスタイルを示しています。また、ヴァン・ゴッホがしばしば描いた自然のモチーフや、彼の周囲の人々の肖像が含まれていることから、彼の作品の中で重要なテーマを反映しています。

マルチモーダル面白い。

はじめてのDatabricks

Databricks無料トライアル

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up