Elasticsearch text_embedding での truncate 設定による max_sequence_length 超過時の挙動を見る

Posted at 2025-01-24

はじめに

ベクトル検索を実行するにあたり、.multilingual-e5-small_linux-x86_64 モデルを使用してエンベッドを実行します。このモデルは、Hugging Face の multilingual-e5-small に以下の記述があるとおり、最大トークン長が512トークンとなっています。

Limitations
Long texts will be truncated to at most 512 tokens.

そこで、こちらの記事を参考にパイプラインでチャンク化することにしました。
しかし、英語データを前提とした記事であり、同じ設定で（＊この記事では "model_limit": 400 と設定しています。）日本語データの場合に、本当に最大長を超えないのか一抹の不安がありました。

日本語データを使用した時に、最大長までに収まっているかどうか(truncateが発生していないか)を確認できる方法がありましたので紹介します。

truncate の設定

Elasticsearchでは、inference processorのオプションでtruncateの挙動を設定できます。今回使用している text_embedding のオプションとして tokenization の設定があります。デフォルトでは、tokenization は bert で、truncate は first となっています。この truncateオプションが first の場合、max_sequence_length を超えると、エラーなど特に何も発生せずにtruncateが発生します。

今回、.multilingual-e5-small_linux-x86_64 を使用していますので、tokenization は xlm_roberta です。xlm_roberta の truncate 設定もデフォルトは first となっています。

このtruncateの設定を none にすることで、max_sequence_length を超えた場合にエラーが発生するようにできます。

Properties of roberta
truncate
(Optional, string) Indicates how tokens are truncated when they exceed max_sequence_length. The default value is first.

none: No truncation occurs; the inference request receives an error.

first: Only the first sequence is truncated.

second: Only the second sequence is truncated. If there is just one sequence, that sequence is truncated.

デフォルト設定のパイプラインでの挙動確認

こちらの記事と同じパイプラインとインデックスで検証してみます。

パイプラインの作成

PUT _ingest/pipeline/japanese-text-embeddings
{
  "description": "Text embedding pipeline",
  "processors": [
    {
      "inference": {
        "model_id": ".multilingual-e5-small_linux-x86_64",
        "target_field": "text_embedding",
        "field_map": {
          "title": "text_field"
        }
      }
    }
  ],
  "on_failure": [
    {
      "set": {
        "description": "Index document to 'failed-<index>'",
        "field": "_index",
        "value": "failed-{{{_index}}}"
      }
    },
    {
      "set": {
        "description": "Set error message",
        "field": "ingest.failure",
        "value": "{{_ingest.on_failure_message}}"
      }
    }
  ]
}

インデックスの作成

PUT japanese-text-with-embeddings
{
  "mappings": {
    "properties": {
      "text_embedding.predicted_value": {
        "type": "dense_vector",
        "dims": 384,
        "index": true,
        "similarity": "cosine"
      }
    }
  }
}

max_sequence_length を超えるドキュメントを登録してみます。

POST japanese-text-with-embeddings/_doc?pipeline=japanese-text-embeddings
{
  "title": "　親譲《おやゆず》りの無鉄砲《むてっぽう》で小供の時から損ばかりしている。小学校に居る時分学校の二階から飛び降りて一週間ほど腰《こし》を抜《ぬ》かした事がある。なぜそんな無闇《むやみ》をしたと聞く人があるかも知れぬ。別段深い理由でもない。新築の二階から首を出していたら、同級生の一人が冗談《じょうだん》に、いくら威張《いば》っても、そこから飛び降りる事は出来まい。弱虫やーい。と囃《はや》したからである。小使《こづかい》に負ぶさって帰って来た時、おやじが大きな眼《め》をして二階ぐらいから飛び降りて腰を抜かす奴《やつ》があるかと云《い》ったから、この次は抜かさずに飛んで見せますと答えた。　親類のものから西洋製のナイフを貰《もら》って奇麗《きれい》な刃《は》を日に翳《かざ》して、友達《ともだち》に見せていたら、一人が光る事は光るが切れそうもないと云った。切れぬ事があるか、何でも切ってみせると受け合った。そんなら君の指を切ってみろと注文したから、何だ指ぐらいこの通りだと右の手の親指の甲《こう》をはすに切り込《こ》んだ。幸《さいわい》ナイフが小さいのと、親指の骨が堅《かた》かったので、今だに親指は手に付いている。しかし創痕《きずあと》は死ぬまで消えぬ。　庭を東へ二十歩に行き尽《つく》すと、南上がりにいささかばかりの菜園があって、真中《まんなか》に栗《くり》の木が一本立っている。これは命より大事な栗だ。実の熟する時分は起き抜けに背戸《せど》を出て落ちた奴を拾ってきて、学校で食う。菜園の西側が山城屋《やましろや》という質屋の庭続きで、この質屋に勘太郎《かんたろう》という十三四の倅《せがれ》が居た。勘太郎は無論弱虫である。"
}

(出典: 夏目漱石「坊ちゃん」)

下の画像のとおり、インデックス japanese-text-with-embeddings へのドキュメント登録は成功します。
デフォルトの設定のとおり、max_sequence_length を超えた部分はtruncateされたと考えられます。

truncate none 設定のパイプラインでの挙動確認

truncate noneを指定したパイプラインを作成します。

パイプラインの作成

PUT _ingest/pipeline/japanese-text-embeddings
{
  "description": "Text embedding pipeline",
  "processors": [
    {
      "inference": {
        "model_id": ".multilingual-e5-small_linux-x86_64",
        "target_field": "text_embedding",
        "field_map": {
          "title": "text_field"
        },
        "inference_config": {
          "text_embedding": {
            "tokenization": {
              "xlm_roberta": {
                 "truncate": "none"
              }   
            }
          }  
        }
      }
    }
  ],
  "on_failure": [
    {
      "set": {
        "description": "Index document to 'failed-<index>'",
        "field": "_index",
        "value": "failed-{{{_index}}}"
      }
    },
    {
      "set": {
        "description": "Set error message",
        "field": "ingest.failure",
        "value": "{{_ingest.on_failure_message}}"
      }
    }
  ]
}

max_sequence_length を超えないドキュメントを登録した場合は、次のように指定したインデックス japanese-text-with-embeddings に書き込まれます。

一方、max_sequence_length を超えるドキュメントを登録した場合、エラーが発生し on_failure の処理となり、ドキュメントは failed-japanese-text-with-embeddings に書き込まれます。

failed-japanese-text-with-embeddings の Document を見ると、次のようにfailの理由を確認することができ、実際のトークン数も表示されています。

"failure": "Input too large. The tokenized input length [532] exceeds the maximum sequence length [512]"
}

on_failureプロセッサーなしの場合

なお、パイプラインからon_failureプロセッサーの設定を削除すると、次のようにドキュメントの登録実行時にそのままエラーが表示されます。（リクエストは単にエラーになり、インデックスに書き込まれることもありません。）

on_failureプロセッサーなしのパイプラインの作成

PUT _ingest/pipeline/japanese-text-embeddings
{
  "description": "Text embedding pipeline",
  "processors": [
    {
      "inference": {
        "model_id": ".multilingual-e5-small_linux-x86_64",
        "target_field": "text_embedding",
        "field_map": {
          "title": "text_field"
        },
        "inference_config": {
          "text_embedding": {
            "tokenization": {
              "xlm_roberta": {
                 "truncate": "none"
              }   
            }
          }  
        }
      }
    }
  ]
}

ドキュメント登録実行結果

{
  "error": {
    "root_cause": [
      {
        "type": "status_exception",
        "reason": "Input too large. The tokenized input length [532] exceeds the maximum sequence length [512]"
      }
    ],
    "type": "status_exception",
    "reason": "Input too large. The tokenized input length [532] exceeds the maximum sequence length [512]"
  },
  "status": 400
}

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up