オープンなRetriever (ColVisionシリーズ) とGenerator (VLM) を用いたVision RAG

Last updated at 2025-05-15Posted at 2025-05-06

はじめに

Vision RAG※は、テキスト含む画像を画像のまま直接扱えるRAGシステムです。画像から一度テキストを抽出してRAGに用いるのではなく、画像を直接エンコードし、テキスト（を含む画像の）検索とテキスト生成を行います。従来のテキストベースのRAGと比べて、画像中の図表の情報をRAGに取り入れるための実装が容易である点や、画像中のテキストと図表の関係性に関する情報を保持できるという点で、評価や期待がされています。

先日、Cohere Embed v4が発表されたことを契機に、Cohere Embed v4を画像のRetrieverとして利用し、GeneratorにGemini 2.5 Flashを利用して、Vision RAGを実装・テストするノートブックが公開されました。

直近、オープンモデルのSLM/SVLMに興味をもっているため、ベンダーのAPIを利用せずに、オープンなRetrieverとGeneratorを用いてVision RAGを実装・テストしてみました。
Retrieverには、ColVisionシリーズのvidore/colqwen2-v1.0を利用しました。
Generatorには、Unslothが4bit量子化したQwen-2.5-7B-InstructとQwen-2.5-32B-Instructを用いました。Qwen-2.5-72B-Instructも公開されていましたが、実験する環境のGPUには乗らなそうであったので、試していません。
ちなみに、RetrieverやGeneratorに、QwenをベースにするモデルやQwen系のモデルを試したのは、Qwen系はオープン日本語LLMリーダーボードで高いスコアを示すことが知られているためです（読めていませんが←、日経から中国アリババQwen、日本製AIの土台に　性能ランクDeepSeek超えという記事が出ましたね）。今回は、日本語ではなく英語のデータセットでRAGのテストをしてみようと思いますが、日本語よりも英語の学習データが多いと思いますので、英語でできないことは日本語でも難しいのかなと思います（そういう予測はできるかなと思います）。

では、GW最終日はりきっていきたいと思います。

※Vision RAGという言葉は、ノートブック上で使われていてたものの、一般的に使われているものではないと思います。VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documentsで、今回試すもの似たRAGが使われていたので、Vision RAGとしたのかなと思います（VisRAGだとこの論文のことを指してしまうため）。

環境

Qwen-2.5-7B-Instructの実験は、Google Colab L4 GPUを利用しました。Qwen-2.5-32B-Instructの実験は、Google Colab A100 GPUを利用しました。

実装

モデルのロード

今回は、VLM, ColVisionシリーズのモデルの順にライブラリのインポートとモデルのロードを行います（Google Colabで実行する際に依存関係の解決をうまく行うため）。

%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth

from unsloth import FastVisionModel # FastLanguageModel for LLMs
import torch

model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Qwen2.5-VL-7B-Instruct",
    load_in_4bit = False, # Use 4bit to reduce memory use. False for 16bit LoRA.
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
)
FastVisionModel.for_inference(model) # Enable for inference!

!pip install -q byaldi
!sudo apt-get install poppler-utils

from byaldi import RAGMultiModalModel
RAG = RAGMultiModalModel.from_pretrained("vidore/colqwen2-v1.0", verbose=1)

RAG用のデータの収集・インデックスの作成

公開されたノートブックと同様に、企業の収支等の図を一枚の画像にまとめたもの6つ用意します。そして、その画像を検索できるようにインデックス化します。

import requests
import os
import io
import base64
import PIL
import tqdm
import time
import numpy as np

# Several images from https://www.appeconomyinsights.com/
images = {
    "tesla.png": "https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbef936e6-3efa-43b3-88d7-7ec620cdb33b_2744x1539.png",
    "netflix.png": "https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23bd84c9-5b62-4526-b467-3088e27e4193_2744x1539.png",
    "nike.png": "https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5cd33ba-ae1a-42a8-a254-d85e690d9870_2741x1541.png",
    "google.png": "https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F395dd3b9-b38e-4d1f-91bc-d37b642ee920_2741x1541.png",
    "accenture.png": "https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08b2227c-7dc8-49f7-b3c5-13cab5443ba6_2741x1541.png",
    "tecent.png": "https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ec8448c-c4d1-4aab-a8e9-2ddebe0c95fd_2741x1541.png"
}

# Download the images and compute an embedding for each image
img_folder = "img"
os.makedirs(img_folder, exist_ok=True)

img_paths = []
for name, url in tqdm.tqdm(images.items()):
    img_path = os.path.join(img_folder, name)
    img_paths.append(img_path)

    # Download the image
    if not os.path.exists(img_path):
        response = requests.get(url)
        response.raise_for_status()

        with open(img_path, "wb") as fOut:
            fOut.write(response.content)

RAG.index(
    input_path="./img/",
    index_name="attention",
    store_collection_with_index=True,
    overwrite=True
)

検索・回答用の関数

def search(question):
    print("Question:", question)
    results = RAG.search(question, k=1)
    image_bytes = base64.b64decode(results[0].base64)
    # Convert bytes to image using Pillow
    image = PIL.Image.open(io.BytesIO(image_bytes))
    display(image)
    return image


# Answer the question based on the information from the image
# Here we use Qwen 2.5-VL-7B-Instruct as a small Vision-LLM
from transformers import TextStreamer
def answer(question, image):
    prompt = f"""Answer the question based on the following image.
Don't use markdown.
Please provide enough context for your answer.

Question: {question}"""

    messages = [
        {"role": "user", "content": [
            {"type": "image"},
            {"type": "text", "text": prompt}
        ]}
    ]
    input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
    inputs = tokenizer(
        image,
        input_text,
        add_special_tokens = False,
        return_tensors = "pt",
    ).to("cuda")

    print("LLM Answer:")
    text_streamer = TextStreamer(tokenizer, skip_prompt = True)
    _ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 512,
                      use_cache = True)

テスト結果

公開されたノートブックと同様に、質問に関連する画像を検索し、画像を読み取って回答できるかテストを行います。

問題

Q1:What is the net profit for Nike?
- ポイント：Nikeの画像中の情報を1つ抽出できるとよい (Net profitは$0.8B)
- 欲しい画像：
Q2:What are the 3 largest acquisitions from Google?
- ポイント：Googleの画像中の情報を3つ抽出できるとよい (WIZ, MOTOROLA, MANDIANT)
- 欲しい画像：
Q3:What would be the net profit of Tesla without interest?
- ポイント：Teslaの画像中の数字を2つ抽出しその引き算を実施できるとよい (Net Profit 0.4 - Interest 0.3 = $0.1B)
- 欲しい画像：
Q4:Is GenAI a good business for consulting companies?
- ポイント：accentureの画像から生成AI案件のbookingが上昇しているトレンドを理解し、生成AIビジネスはコンサル業界で良いビジネスとなることを言及できれば良い
- 欲しい画像：
Q5:In which region does Netflix generate the highest revenue?
- ポイント：Netflixの画像中の情報を1つ抽出できるとよい (UCAN)
- 欲しい画像：
Q6:How much could tecent grow their revenue year-over-year for the last 5 years?
- ポイント：Tencentの画像中の数字を5つ抽出しその数字で比を計算できるとよい (2020年 ~ 2024年の間の前年比の成長率がわかるとよい）
- 欲しい画像：

colqwen2-v1.0の検索結果

colqwen2-v1.0を使うことで、6個の画像の中から、質問の回答に必要な画像を選び出すことができました。
今回は非常に簡単な問題設定なので、より現実的な問題設定でどの程度活躍するか知りたいところです。
例えば、検索先に類似の画像がある場合（Google関連の画像が複数枚ある場合など）や、英語以外の言語が書かれた画像を検索する場合などを想定しています。今度、どれぐらい時間があるときにベンチマーク結果を調べてみようかと思います。

Qwen2.5-VL-7B-Instructの生成結果

Q1は正解

The net profit for Nike in Q3 FY25, as shown in the image, is $0.8 billion.<|im_end|>

Q2は正解（回答の順序は気になる）

The three largest acquisitions from Google, according to the image, are:

Motorola Mobility - Acquired in 2012 for $12.5 billion.

Wiz - Expected to be acquired in 2026 for $32.0 billion.

Mandiant - Acquired in 2014 for $5.4 billion.<|im_end|>

Q3は不正解（順序立てて図の情報をもとに計算を行い欲しい情報を得ようとした結果、誤った値をはじき出してしまった。図中のTotal RevenueとCost of Revenueが丸め込まれいる（と思われる）ので、Total Revenueからnet profitまで真面目に計算しようとすると最終的な答えを間違えることになる。）

To determine the net profit of Tesla without interest, we need to follow these steps:

Identify the total revenue from the income statement.

Subtract the cost of revenue from the total revenue to get the gross profit.

Subtract operating expenses from the gross profit to get the operating profit.

Subtract interest from the operating profit to get the net profit.

From the image:

Total Revenue = $19.3B

Cost of Revenue = $16.2B

Operating Profit = $0.4B

Interest = $0.3B

Now, let's calculate the net profit without interest:

Gross Profit = Revenue - Cost of Revenue
Gross Profit = $19.3B - $16.2B
Gross Profit = $3.1B

Operating Profit = Gross Profit - Operating Expenses
Operating Profit = $3.1B - ($2.8B)
Operating Profit = $0.3B

Net Profit (with interest) = Operating Profit - Interest
Net Profit (with interest) = $0.3B - $0.3B
Net Profit (with interest) = $0.0B

Therefore, the net profit of Tesla without interest is $0.0B.<|im_end|>

Q4は正解（genAI is a good business for consulting companiesという言及あり）

The graph shows Accenture's generative AI bookings over several fiscal quarters, measured in billions of dollars. The data indicates a steady increase in bookings from Q2 FY23 to Q2 FY25. Here's a breakdown:

Q2 FY23: $0.1 billion

Q3 FY23: $0.1 billion

Q4 FY23: $0.3 billion

Q1 FY24: $0.5 billion

Q2 FY24: $0.6 billion

Q3 FY24: $0.9 billion

Q4 FY24: $1.0 billion

Q1 FY25: $1.2 billion

Q2 FY25: $1.4 billion

This trend suggests that generative AI is becoming a significant area of interest and investment for consulting companies like Accenture. The consistent growth in bookings implies that there is a growing demand for AI services, which could be a strong indicator that genAI is a good business for consulting companies. As the market for AI continues to evolve, it appears that consulting firms are positioning themselves to capitalize on this growth by offering specialized services related to AI implementation and strategy.<|im_end|>

Q5は正解

Netflix generates the highest revenue in the United States and Canada (UCAN) region, with $4.6 billion in revenue.<|im_end|>

Q6は不正解（the last five yearsを理解せず2018~2023の値を見ている。2020~2024の推移が知りたいはず。）

Based on the provided image, Tencent's revenue growth over the last five years can be calculated as follows:

2018 to 2019: Revenue increased from 316 billion RMB to 377 billion RMB, which is an increase of 61 billion RMB or approximately 19.4%.

2019 to 2020: Revenue increased from 377 billion RMB to 477 billion RMB, which is an increase of 100 billion RMB or approximately 26.4%.

2020 to 2021: Revenue increased from 477 billion RMB to 560 billion RMB, which is an increase of 83 billion RMB or approximately 17.4%.

2021 to 2022: Revenue decreased slightly from 560 billion RMB to 555 billion RMB, which is a decrease of 5 billion RMB or approximately -0.9%.

2022 to 2023: Revenue increased from 555 billion RMB to 609 billion RMB, which is an increase of 54 billion RMB or approximately 9.6%.

So, Tencent's revenue growth year-over-year for the last five years ranges from a decrease of 0.9% in 2021 to a maximum increase of 26.4% in 2019.<|im_end|>

Qwen2.5-VL-32B-Instructの生成結果

Q1は正解

The net profit for Nike in Q3 FY25, as shown in the image, is $0.8B (or $800 million). This figure is indicated at the end of the flowchart, where it is explicitly labeled as "Net profit" with a value of $0.8B and a margin of 7%.<|im_end|>

Q2は正解

The three largest acquisitions by Google, based on the image provided, are:

Wiz - $32.0 billion (expected in 2026)

Motorola Mobility - $12.5 billion (acquired in 2012)

Mandiant - $5.4 billion (acquired in 2022)<|im_end|>

Q3は不正解（利子を払う側だと誤解）

To determine the net profit of Tesla without considering interest, we need to follow these steps:

Identify the Net Profit: From the image, the net profit is explicitly stated as $0.4B.

Identify the Interest Expense: The image shows that the interest expense is $0.3B.

Calculate Net Profit Without Interest: To find the net profit without interest, we add the interest expense back to the net profit. This is because the interest expense reduces the net profit, so removing it will give us the profit before interest.

[
\text{Net Profit Without Interest} = \text{Net Profit} + \text{Interest Expense}
]

Substituting the values from the image:

[
\text{Net Profit Without Interest} = $0.4B + $0.3B = $0.7B
]

Final Answer:

The net profit of Tesla without interest would be $0.7B.<|im_end|>

Q4は正解（GenAI appears to be a very good businessという言及あり）

The image shows the growth of Accenture's Generative AI (GenAI) bookings over several ?fiscal quarters, measured in billions of dollars. Here is an analysis based on the data presented:

Key Observations:

Initial Growth (FY23):

In Q2 FY23, GenAI bookings were $0.1 billion.

By Q4 FY23, bookings increased to $0.3 billion.

In Q1 FY24, bookings further rose to $0.5 billion.

Accelerated Growth (FY24):

In Q2 FY24, bookings reached $0.6 billion.

By Q3 FY24, bookings increased to $0.9 billion.

In Q4 FY24, bookings jumped to $1.0 billion.

Continued Momentum (FY25):

In Q1 FY25, bookings were $1.2 billion.

By Q2 FY25, bookings peaked at $1.4 billion.

Trends and Implications:

Consistent Growth: The data clearly shows a consistent upward trend in GenAI bookings over time. This indicates strong demand and increasing adoption of GenAI solutions by clients.

Significant Increase: The bookings have grown from $0.1 billion in Q2 FY23 to $1.4 billion in Q2 FY25, representing a 14-fold increase over two years.

Scalability and Demand: The rapid growth suggests that GenAI is not only a viable but also a highly scalable business opportunity for consulting companies like Accenture.

Conclusion:

Based on the data provided, GenAI appears to be a very good business for consulting companies. The consistent and significant growth in bookings over time indicates high demand and successful implementation of GenAI solutions. Consulting firms can leverage this trend to drive revenue, innovate services, and stay competitive in the rapidly evolving technology landscape. Therefore, GenAI is likely to remain a lucrative area for consulting companies moving forward.<|im_end|>

Q5は正解

To determine in which region Netflix generates the highest revenue, let's analyze the information provided in the image:

UCAN (United States and Canada):

Revenue: $4.6B

Growth: +9% Y/Y

EMEA (Europe, Middle East, and Africa):

Revenue: $3.4B

Growth: +15% Y/Y

LATAM (Latin America):

Revenue: $1.3B

Growth: +8% Y/Y

APAC (Asia-Pacific):

Revenue: $1.3B

Growth: +23% Y/Y

Step-by-Step Analysis:

UCAN: Generates $4.6B in revenue, which is the highest among all regions.

EMEA: Generates $3.4B in revenue, which is lower than UCAN but higher than LATAM and APAC.

LATAM: Generates $1.3B in revenue, which is the same as APAC.

APAC: Also generates $1.3B in revenue, matching LATAM.

Conclusion:

The region where Netflix generates the highest revenue is UCAN (United States and Canada) with $4.6B in revenue.

Final Answer: UCAN<|im_end|>

Q6は不正解（the last five yearsを理解せず2019~2024の値を見ている。2020~2024の推移が知りたいはず。）

これはmax_new_tokens = 2048で出力しました。

To determine how much Tencent's revenue could have grown year-over-year over the last five years, we need to analyze the revenue data provided in the image for each year from 2019 to 2024. Here's the step-by-step analysis:

Step 1: Extract Revenue Data

The image shows Tencent's revenue in billions of RMB for each year from 2018 to 2024. The values are as follows:

2018: 316 billion RMB

2019: 377 billion RMB

2020: 477 billion RMB

2021: 560 billion RMB

2022: 555 billion RMB

2023: 609 billion RMB

2024: 660 billion RMB

Step 2: Calculate Year-Over-Year Growth

We will calculate the percentage growth for each year compared to the previous year.

2019 vs. 2018

[ \text{Growth} = \left( \frac{\text{Revenue in 2019} - \text{Revenue in 2018}}{\text{Revenue in 2018}} \right) \times 100 ]
[ \text{Growth} = \left( \frac{377 - 316}{316} \right) \times 100 ]
[ \text{Growth} = \left( \frac{61}{316} \right) \times 100 \approx 19.3% ]

2020 vs. 2019

[ \text{Growth} = \left( \frac{\text{Revenue in 2020} - \text{Revenue in 2019}}{\text{Revenue in 2019}} \right) \times 100 ]
[ \text{Growth} = \left( \frac{477 - 377}{377} \right) \times 100 ]
[ \text{Growth} = \left( \frac{100}{377} \right) \times 100 \approx 26.5% ]

2021 vs. 2020

[ \text{Growth} = \left( \frac{\text{Revenue in 2021} - \text{Revenue in 2020}}{\text{Revenue in 2020}} \right) \times 100 ]
[ \text{Growth} = \left( \frac{560 - 477}{477} \right) \times 100 ]
[ \text{Growth} = \left( \frac{83}{477} \right) \times 100 \approx 17.4% ]

2022 vs. 2021

[ \text{Growth} = \left( \frac{\text{Revenue in 2022} - \text{Revenue in 2021}}{\text{Revenue in 2021}} \right) \times 100 ]
[ \text{Growth} = \left( \frac{555 - 560}{560} \right) \times 100 ]
[ \text{Growth} = \left( \frac{-5}{560} \right) \times 100 \approx -0.9% ]

2023 vs. 2022

[ \text{Growth} = \left( \frac{\text{Revenue in 2023} - \text{Revenue in 2022}}{\text{Revenue in 2022}} \right) \times 100 ]
[ \text{Growth} = \left( \frac{609 - 555}{555} \right) \times 100 ]
[ \text{Growth} = \left( \frac{54}{555} \right) \times 100 \approx 9.7% ]

2024 vs. 2023

[ \text{Growth} = \left( \frac{\text{Revenue in 2024} - \text{Revenue in 2023}}{\text{Revenue in 2023}} \right) \times 100 ]
[ \text{Growth} = \left( \frac{660 - 609}{609} \right) \times 100 ]
[ \text{Growth} = \left( \frac{51}{609} \right) \times 100 \approx 8.4% ]

Step 3: Summarize the Results

Here are the year-over-year growth rates for the last five years:

2019: +19.3%

2020: +26.5%

2021: +17.4%

2022: -0.9%

2023: +9.7%

2024: +8.4%

Final Answer

Tencent's revenue growth year-over-year for the last five years (2019–2024) was as follows:

2019: +19.3%

2020: +26.5%

2021: +17.4%

2022: -0.9%

2023: +9.7%

2024: +8.4%

Boxed Answer:
[
\boxed{+19.3%, +26.5%, +17.4%, -0.9%, +9.7%, +8.4%}
]<|im_end|>

Generatorの結果まとめ

Qwen2.5-VL-7B-InstructでもQwen2.5-VL-32B-Instructでも、Q3, Q6以外が正解となりました。Q3, Q6は、画像の情報をもとに計算して答えを出す必要がある問いです。この問題の特徴から、Qwen2.5-VL-7/32B-Instructでは、一度のやり取りで情報の取得から計算を行うのが難しかったと推察されます。情報の取得で問題は見られなかったので、計算（推論）のところただけ一歩足りなかったのだと思います。
また、Q3, Q6の回答では、特に指示していないもののStep by Stepで答えを出そうとしていました。この点には少し驚きました。おそらく、CoTのデータがSFTデータセットに含まれていたのだと思います。
オープンなVLMを使いつつQ3, Q6を解くためには、情報の取得と推論を別のプロセスに分けて対応する、別のプロセスに分けたうえで今回のような計算にはツール呼び出しをさせる、より賢いVLM（推論モード対応したもの）を使ってパワーでごりおす等のアプローチがあるかなと思います。推論モードに対応したQwen3のシリーズが最近発表されたので、Qwen3-VLの登場に期待しています。
ちなみに、クローズなモデル（Cohere Embed v4 🤝 Gemini 2.5 Flash）では、どの問題もクリアしているようでした。流石ですね。

今回は、一枚の図について情報の取得・推論を試しましたが、実用上はテキストが大部分を占めと図や表が間に挟まっているみたいなケースが多いのではと思います。そのような場合でも必要な情報を読み取り、回答できるのか気になるところです。また、複数の画像を同時に扱うケースもうまくできるのか気になるところです。これも時間があるときにベンチマーク結果を調べてみたいと思います。

おわりに

本記事では、オープンモデルの ColVisionシリーズのRetriever と Qwen-2.5-VL 系列 Generator を組み合わせてVision RAGを実装し、Google Colab 上でテストを行いました。テストでは、colqwen2-v1.0を用いて質問の回答に必要な画像を得ることができました。また、Qwen-2.5-VL-7B/Instruct と 32B/Instruct の両モデルについて、画像中の情報抽出タスク（Q1, Q2, Q4, Q5）では正解し、数値の計算を伴うタスク（Q3, Q6）では誤答するという結果になりました。
クローズドなつよつよモデルには及ばないものの、一定、オープンモデルもVision RAGの部品に使えることが確かめられました。ただ、より実用的な問題設定でもオープンモデルが使えるのかは確かめられていないので、調査・実験したいと思います。
GW 最終日に駆け足でお届けしましたが、上記ご参考になれば幸いです。

参考資料

以下、参考にさせていただきました。ありがとうございます。

追加実験

日本語の画像でVision RAGを確かめる

いわゆる霞が関曼荼羅2枚（環境省の地域循環共生圏のスライドと厚生労働省の医療・介護の提供体制の将来像の例）について、Vision RAGを試します。
2つのケースを実験します。

ケースクローズド: Cohere Embed v4を画像のRetrieverに利用し、GeneratorにGemini 2.5 Flashを利用するケース
ケースオープン: ColVisionシリーズのvidore/colqwen2-v1.0をRetrieverに利用し、Unslothが4bit量子化したQwen-2.5-7B-InstructをGeneratorに利用するケース

Generatorへのプロンプトは変えていないです。

問題

Q1:地域循環共生圏ではどのような交通システムが求められていますか?
- ポイント：地域循環共生圏の画像中の情報を1~3つ抽出できるとよい (地域循環共生圏では、人々に優しく魅力ある「交通・移動」システムが求められていることを最低限言及してほしい)
- 欲しい画像：
Q2:地域包括ケアシステムとはどのようなものですか?
- ポイント：医療・介護の提供体制の将来像の例の画像中の情報1つ抽出できるとよい (日常生活圏域内において、医療、介護、予防、住まいが切れ目なく、継続的かつ一体的に提供されるシステム)
- 欲しい画像：

1. クローズドなモデルを使った場合

Retriever

画像を2枚しか用意していないこともあり、それぞれの質問について必要な画像1枚を正しくピックアップできていました。

Generator

Q1は正解（図中の関連する情報を含めてリッチに回答してくれている）

画像によると、地域循環共生圏では、人々に優しく魅力ある「交通・移動」システムが求められています。これは、安心と利便性で高齢者や子育て世代に優しい移動手段であり、また地域の魅力を引き出す交通システムであると説明されています。

Q2は不正解（地域包括ケアシステムのスコープを日常生活圏外にもあててしまっている）

地域包括ケアシステムとは、日常生活圏域内において、医療、介護、予防、住まいが切れ目なく、継続的かつ一体的に提供されるシステムです。このシステムは、機能分化を進めながら、地域住民を重層的に支える医療・介護サービスのネットワークを構築することを目的としています。具体的には、小・中学校区レベルでの日常的な医療・介護サービスの提供体制や、人口20万～30万レベルでの基幹病院を中心としたネットワーク、都道府県レベルでの高度な医療への対応体制などを整備し、多様なニーズに対応できる包括的なケアの実現を目指しています。

2. オープンなモデルを使った場合

Retriever

画像を2枚しか用意していないこともあり、それぞれの質問について必要な画像1枚を正しくピックアップできていました。

Generator

Q1は正解（図中の関連する情報を含めてリッチに回答してくれている。Geminiを使ったときよりもモデルの知識（エコモビリティはグリーンスローモビリティを包括していそう）と紐づけながらリッチに説明している。）

地域循環共生圏では、人にとって優しい魅力のある「交通・移動システム」が求められています。このシステムは、高齢者や子育て世代に優しく、安心と利便性を提供することが重要です。また、地域の魅力を引き出す交通システムとして機能することも期待されています。具体的には、公共交通機関や自転車シェアリングサービス、エコモビリティサービスなどが考えられます。これらの交通手段は、環境に優しく、持続可能な社会を実現するための重要な要素となります。<|im_end|>

Q2は不正解（地域包括ケアシステムのスコープを日常生活圏外にもあててしまっている）

地域包括ケアシステムは、日常生活圏域内において、医療、介護、予防、住まいが切れ目なく、継続的かつ一体的に提供されるシステムです。このシステムの特徴は以下の通りです：

日常生活圏域内でのサービス：小・中学校区レベル（人口1万人程度の圈域）において日常的な医療・介護サービスが提供されます。

地域の基幹病院機能：人口20～30万人レベルで地域の基幹病院機能を整備します。

高度医療への体制：都道府県レベルで救命救急・がんなどの高度医療への体制を整備します。

医療提供体制の充実と重点化・効率化：都道府県域から市町村域まで、重層的に医療サービスを提供します。

地域包括ケアの実現：包括的ケアマネジメントの機能強化を目指します。

連携強化：地域に密着した病院や在宅医療連携拠点機能を強化し、救急患者の確実な受入れや早期の地域復帰を図ります。

地域連携の強化：地域の多様なニーズに対応し、地域の医療機関のネットワークを構築します。

新しい公共パートナーシップ：認知症サポーター、生活・介護支援サポーター、NPO、住民参加など、地域の支え合いを強化します。

医療人材と介護人材の育成：地域包括ケアシステムの実現に必要な人材育成を推進します。

このシステムは、地域の医療・介護サービスを一貫して提供し、住民の生活を支えることを目的としています。<|im_end|>

日本語の画像でVision RAGを簡易的にテストまとめ

Retrievalのテストでは、クエリ中のキーワードを含む画像を2枚中1枚を当てるという簡単な問題であったからか、Gemini 2.5 FlashでもColVisionシリーズのcolqwen2-v1.0でもクリアできました。画像の枚数を増やしたり、似た画像を追加したらどうなるか気になるところですが、一定、日本語でもオープンなモデルを使って画像を検索できることが確かめられました。
Generator のテストでは、クローズドなモデル（Gemini 2.5 Flash）もオープンなモデル（Qwen-2.5-7B-Instruct）も、Q1「交通システム」の問いではスライドにある「人々に優しく魅力ある交通・移動システム」を正確に抽出して回答できましたが、Q2「地域包括ケアシステム」の問いには、「地域包括ケアシステムが日常生活圏域内のシステムであること」に矛盾した誤答をしてしまいました。この誤答については、文章生成におけるテクニック（CoTさせる、情報の取得と推論を分ける、自身の出力をレビューさせて再回答させる等）を使うことで無くすことができるのではないかと考えています。たったの2問でのテストですが、オープンなモデルでもクローズドなモデルと同等の出力を出せている点、および、文章生成のパイプラインを工夫すればこのオープンなモデルで今回検知した誤答に対応できそうな点から、オープンなモデルの日本語の画像でのVision RAGにも期待が持てます。

追記するかもしれない事項

ColVision/ColPaliについて
VisRAGについて
byaldiについて
Unslothの4bit量子化について

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

オープンなRetriever (ColVisionシリーズ) とGenerator (VLM) を用いたVision RAG

はじめに

環境

実装

モデルのロード

RAG用のデータの収集・インデックスの作成

検索・回答用の関数

テスト結果

問題

colqwen2-v1.0の検索結果

Qwen2.5-VL-7B-Instructの生成結果

Qwen2.5-VL-32B-Instructの生成結果

Final Answer:

Key Observations:

Trends and Implications:

Conclusion:

Step-by-Step Analysis:

Conclusion:

Step 1: Extract Revenue Data

Step 2: Calculate Year-Over-Year Growth

2019 vs. 2018

2020 vs. 2019

2021 vs. 2020

2022 vs. 2021

2023 vs. 2022

2024 vs. 2023

Step 3: Summarize the Results

Final Answer

Generatorの結果まとめ

おわりに

参考資料

追加実験

日本語の画像でVision RAGを確かめる

問題

1. クローズドなモデルを使った場合

Retriever

Generator

2. オープンなモデルを使った場合

Retriever

Generator

日本語の画像でVision RAGを簡易的にテスト まとめ

追記するかもしれない事項

日本語の画像でVision RAGを簡易的にテストまとめ