More than 1 year has passed since last update.

GPT-4o の入力に 2つ／3つの画像ファイルを使う（画像は Base64エンコード）：OpenAI の Node.js用ライブラリを利用

Last updated at 2024-05-19Posted at 2024-05-19

以下の記事で試していた、GPT-4o での画像入力の話の続きです。

●GPT-4o の入力に Base64エンコードした画像ファイルを使う：Node.js の process.loadEnvFile() と OpenAI のライブラリの組み合わせ - Qiita
　https://qiita.com/youtoy/items/3844c6904b6a39fdad64

今回は、公式のドキュメントの以下に書かれている「Multiple image inputs」を試してみます。
（以下を見かけて、なんとなく試してみたくなったため）

●Vision - OpenAI API
　https://platform.openai.com/docs/guides/vision/multiple-image-inputs

上記の公式サンプルでは、入力画像は 2つで、その画像は URL で指定しています。

この記事では、それらの部分を以下のようにして試そうと思います。

入力画像は、2つの場合と 3つの場合を試す
入力画像はローカルにあるファイルを用い、それを Node.js で Base64エンコードする

それと実装について、過去の記事と同様に、OpenAI の Node.js用ライブラリを利用しつつ、それと合わせて Node.js の process.loadEnvFile() を使います。

実際に試してみる

それでは、実際に試してみます。

用いる入力画像

まずは、今回使う入力画像を示します。

画像は、自分が過去に作った作品の画像の中で、以下の 3つを用います。

実装内容

あとは、前回の記事で用いたプログラムをベースに、今回用のプログラムを作ります。

それと、環境変数の指定については、過去の記事と同様に「Node.js の process.loadEnvFile()」を使います（※ その使い方などについては、過去の記事の記載をご参照ください）。

それでは、入力画像を 2つの場合の例と、3つの場合の例を順に示します。

入力画像が 2つの場合

以下は、入力画像が 2つの場合の内容です。

import OpenAI from "openai";
import fs from "fs";

process.loadEnvFile("./development.env");

const openai = new OpenAI();

async function main() {
  const message = "これらの画像には何がうつってる？";
  console.log(message);

  const imagePaths = ["./input1.jpg", "./input2.jpg"];
  const imageBase64List = imagePaths.map((imagePath) => {
    const imageBuffer = fs.readFileSync(imagePath);
    return imageBuffer.toString("base64");
  });

  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "user",
        content: [
          { type: "text", text: message },
          {
            type: "image_url",
            image_url: {
              url: `data:image/jpeg;base64,${imageBase64List[0]}`,
            },
          },
          {
            type: "image_url",
            image_url: {
              url: `data:image/jpeg;base64,${imageBase64List[1]}`,
            },
          },
        ],
      },
    ],
  });
  console.log(response.choices[0]);
}
main();

上記を実行した結果は、以下のとおりです。

プロンプトを「これらの画像には何がうつってる？」というシンプルなもので試しましたが、2つの画像のそれぞれについて、おおまかな内容が説明されているのが確認できます。

入力画像が 3つの場合

以下は、入力画像が 3つの場合の内容です。

import OpenAI from "openai";
import fs from "fs";

process.loadEnvFile("./development.env");

const openai = new OpenAI();

async function main() {
  const message = "これらの画像には何がうつってる？";
  console.log(message);

  const imagePaths = ["./input1.jpg", "./input2.jpg", "./input3.jpg"];
  const imageBase64List = imagePaths.map((imagePath) => {
    const imageBuffer = fs.readFileSync(imagePath);
    return imageBuffer.toString("base64");
  });

  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "user",
        content: [
          { type: "text", text: message },
          {
            type: "image_url",
            image_url: {
              url: `data:image/jpeg;base64,${imageBase64List[0]}`,
            },
          },
          {
            type: "image_url",
            image_url: {
              url: `data:image/jpeg;base64,${imageBase64List[1]}`,
            },
          },
          {
            type: "image_url",
            image_url: {
              url: `data:image/jpeg;base64,${imageBase64List[2]}`,
            },
          },
        ],
      },
    ],
  });
  console.log(response.choices[0]);
}
main();

上記を実行した結果は、以下のとおりです。

プロンプトは、入力画像 2つで試した時と同じです。

今回も、入力画像のそれぞれについて、おおまかな内容が説明されているのが確認できます。

レスポンスで 3つ目の画像について説明されている部分では、デバイスに書かれた企業名も記載に含まれていました
※ デバイスに書かれた企業名は、3つのうち 2つが同じ名前なのですが、複数のデバイスに同じ企業名が書かれているということも認識されているようでした

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up