More than 1 year has passed since last update.

OpenSearch2.13にmultilingual-e5-largeをのせる。

OpenSearch

Posted at 2024-04-25

前の記事OpenSearchでいい感じに検索できるようにしたい初見タイムアタック（とりあえずハイブリッド検索からリランクまでの道）の続きです。

案外みんな…OpenSearchが何かとか何ができるかとかしらない…！？

2.13でいろいろできるようになってるのでそのまとめ記事も書きたいですがまたの機会に。

今回は、公開モデル最強のembeddingと言われてるmultilingual-e5-largeもしくはmultilingual-e5-large-instructあたりをOpenSearchのベクトル検索に埋め込む方法です。ちょっとむずかしかった。

性能がどのくらいなのか興味がある人はリーダーボードみましょう。
https://huggingface.co/spaces/mteb/leaderboard

使った体感はparaphrase-multilingual-MiniLM-L12-v2あたりと比較すると天と地の差がある性能差です。メモリは多めに喰うけど別にGPUなくてもなんもきにならん（はずGPUホントに使ってないか自信がないので）

なお、text-embedding-3-largeとか使う場合は、こっち見てね。
https://opensearch.org/docs/latest/ml-commons-plugin/remote-models/index/

結論としてはembedding用の設定
みながら

POST /_plugins/_ml/connectors/_create
{
  "name": "<YOUR CONNECTOR NAME>",
  "description": "<YOUR CONNECTOR DESCRIPTION>",
  "version": "<YOUR CONNECTOR VERSION>",
  "protocol": "http",
  "parameters": {
    "model": "text-embedding-3-large"
  },
  "credential": {
    "openAI_key": "<PLEASE ADD YOUR OPENAI API KEY HERE>"
  },
  "actions": [
    {
      "action_type": "predict",
      "method": "POST",
      "url": "https://api.openai.com/v1/embeddings",
      "headers": {
        "Authorization": "Bearer ${credential.openAI_key}"
      },
      "request_body": "{ \"input\": ${parameters.input}, \"model\": \"${parameters.model}\" }",
      "pre_process_function": "connector.pre_process.openai.embedding",
      "post_process_function": "connector.post_process.openai.embedding"
    }
  ]
}

でコネクター作って

POST /_plugins/_ml/models/_register
{
  "name": "text-embedding-3-large",
  "function_name": "remote",
  "model_group_id": "＜モデルグループ＞",
  "description": "text-embedding-3-large",
  "connector_id": "＜さっきのコネクタのID＞"
}

で登録できる。デプロイはリモートだと自動。らくしょー。

はい脱線しました。

任意のモデルをデプロイする

基本はここにかいてあります。
https://opensearch.org/docs/latest/ml-commons-plugin/custom-local-models/

どうも現時点ではTorch ScriptかONNXのモデルしかサポートしていないようです。
でも、あきらめてはいけません。ONNXのモデル、かなり転がってます。

手順の流れ

ONNXのモデルを探してくる
zipで固める
ファイルサイズとかhashとったりする
なんとかURLにする（しなくてやる方法がわからん
設定投げる

ここまで解読するのに罠がおおくてつらかった

ONNXのモデルを探してくる。

huggingfaceのレポジトリをよく見てみます。よーくながめます。

なんかBotがONNXに変換してくれてる！？

というわけでありがたく使わせてもらいます。

ここのフォルダの中身を全部ダウンロードしてきます。フォルダはWSLの中とかに移した方がいいです。Unix系のところはたぶん必須です。wgetとかでスクリプトにすれば自動化できそう（してない）

こんな感じにしました。

Zipで固める。

最初、Windowsで固めたらinvalid CEN header (bad compression method: 9) とかでてめっちゃ時間を浪費しました…。　（参考）

Linux初心者なのでよくわからんですが、さっきのe5のディレクトリにいって

zip ../multilingual-e5-large-instruct_v1.zip ./*.*

とかやってzipに固めます。フォルダの中身のいい指定のしかたしらん…。

これでzipに直でさっきダウンロードしてきたONNXのモデルたちが入っている状態になります。

ファイルサイズとかhashとったりする。

あとで投げる用のJSONのひな型です。名前やバージョンは任意に決めれそう。model_group_idは前回作ってたmodel_groupのやつです。その他ml用設定とかはいるけどやってある前提です。で、ここのJSONのmodel_content_size_in_bytesにファイルのサイズを、model_content_hash_valueにhashをいれます。


{
	"name": "intfloat/multilingual-e5-large-instruct-v1",
	"version": "1.0.0",
    "model_group_id": "RzSd7I4BBmWZ2-waQ5by",
	"description": "This is a multilingual-e5-large-instruct model: It maps sentences & paragraphs to a 1024 dimensional dense vector space and can be used for tasks like clustering or semantic search.",
	"model_task_type": "TEXT_EMBEDDING",
	"model_format": "ONNX",
	"model_content_size_in_bytes": 1313606525,
	"model_content_hash_value": "a42fe70c1832bf16785f76a8cbdabd7d5c357517d906cd266cb6e33a37c2aaae",
	"model_config": {
		"pooling_mode": "mean",
		"normalize_result": "true",
		"model_type": "xlm-roberta",
		"embedding_dimension": 1024,
		"framework_type": "huggingface_transformers",
		"all_config": "{\"_name_or_path\": \"intfloat/multilingual-e5-large-instruct\", \"architectures\": [\"XLMRobertaModel\"], \"attention_probs_dropout_prob\": 0.1,\"bos_token_id\": 0, \"classifier_dropout\": null, \"eos_token_id\": 2, \"export_model_type\": \"transformer\",\"hidden_act\": \"gelu\",\"hidden_dropout_prob\": 0.1,\"hidden_size\": 1024,\"initializer_range\": 0.02, \"intermediate_size\": 4096,\"layer_norm_eps\": 1e-05, \"max_position_embeddings\": 514,\"model_type\": \"xlm-roberta\", \"num_attention_heads\": 16,\"num_hidden_layers\": 24, \"output_past\": true,\"pad_token_id\": 1,\"position_embedding_type\": \"absolute\", \"torch_dtype\": \"float16\", \"transformers_version\": \"4.39.3\", \"type_vocab_size\": 1, \"use_cache\": true, \"vocab_size\": 250002 }"
	},
	"created_time": 1676072210947,
    "url":"http://fastapi_for_model/get_file/model/multilingual-e5-large-instruct_v1.zip"
}

ファイルサイズはls -laとかででてきます

-rw-r--r-- 1 myu65 myu65 1313606525 Apr 25 23:43 multilingual-e5-large-instruct_v1.zip

この1313606525の数字を入れてます。

hashは

shasum -a 256 multilingual-e5-large-instruct_v1.zip

で出てきたのを入れてます。

また、all_configのところはモデルのonnx_config.jsonの中の記載を"をエスケープして入れてます。

なんとかURLにする（しなくてやる方法がわからん

モデルをOpenSearchに入れるためにエンドポイントでアクセスできるようにしたいです。ここのいいやり方がわからん。

誰か親切な人がhuggingfaceとかにzipとconfigを公開してくれてると超らくなんだけどなー

今回は…fastapiで適当にたてました。

docker-composeを改造します

docker-compose.yaml

version: '3'
services:
  opensearch: # This is also the hostname of the container within the Docker network (i.e. https://opensearch-node1/)
    build: .
    # image: opensearchproject/opensearch:2.13.0 # Specifying the latest available image - modify if you want a specific version
    container_name: opensearch
    environment:
      - discovery.type=single-node
      - bootstrap.memory_lock=true # Disable JVM heap memory swapping
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m" # Set min and max JVM heap sizes to at least 50% of system RAM
      - OPENSEARCH_INITIAL_ADMIN_PASSWORD=${OPENSEARCH_INITIAL_ADMIN_PASSWORD}    # Sets the demo admin user password when using demo configuration, required for OpenSearch 2.12 and later
    ulimits:
      memlock:
        soft: -1 # Set memlock to unlimited (no soft or hard limit)
        hard: -1
      nofile:
        soft: 65536 # Maximum number of open files for the opensearch user - set to at least 65536
        hard: 65536
    volumes:
      - opensearch-data1:/usr/share/opensearch/data # Creates volume called opensearch-data1 and mounts it to the container
    ports:
      - 9200:9200 # REST API
      - 9600:9600 # Performance Analyzer
    networks:
      - opensearch-net # All of the containers will join the same Docker bridge network

  opensearch-dashboards:
    image: opensearchproject/opensearch-dashboards:2.13.0 # Make sure the version of opensearch-dashboards matches the version of opensearch installed on other nodes
    container_name: opensearch-dashboards
    ports:
      - 5601:5601 # Map host port 5601 to container port 5601
    expose:
      - "5601" # Expose port 5601 for web access to OpenSearch Dashboards
    environment:
      OPENSEARCH_HOSTS: '["https://opensearch:9200"]' # Define the OpenSearch nodes that OpenSearch Dashboards will query
    networks:
      - opensearch-net

  # 追加！
  fastapi_for_model:
    image: tiangolo/uvicorn-gunicorn-fastapi
    container_name: fastapi_for_model

    volumes:
      - ./model:/model
      - ./fastapi/app:/app

    ports:
      - 8000:80
    # 塞いどいていい。ローカルテスト用にあけてある

    networks:
      - opensearch-net


volumes:
  opensearch-data1:

networks:
  opensearch-net:

で、fastapi/app/main.pyに

fastapi/app/main.py

 
import uvicorn
from fastapi import FastAPI
from fastapi.responses import FileResponse
 
app = FastAPI()
 
@app.get("/")
def home():
    return {"message": "model download url App"}
 
 
@app.get("/get_file/model/{filename:path}")
async def get_file(filename: str):
    '''モデルファイルのダウンロード'''

    file_path =  "/model/" + filename
        
    response = FileResponse(
        path=file_path,
        filename=filename
    )
 
    return response
 
 
if __name__ == "__main__":
    uvicorn.run("app:app", host="127.0.0.1", port=8000, reload=True)

こんな感じで書いておきます。
tiangolo/uvicorn-gunicorn-fastapiはfastapi公式のイメージで/app/main.pyを実行してくれます。簡単なAPIならこれで十分。

OpenSearchと同じネットワークにいるのでhttpリクエストで見に行くことができます。

さっきの後で投げる用のconfigでURLが"url":"http://fastapi_for_model/get_file/model/multilingual-e5-large-instruct_v1.zip"
となっていたのはこの関係です。

最終的にこんなファイル構成です

e5ディレクトリとかはzipができてればもういらないですね。

これでdocker compose upで動かせばファイルダウンロードもできるようになっているはずです。

とか軽いのでファイルがダウンロードできるかテストしておきます。

設定投げる

OpenSearch DashboardのconsoleでPOST /_plugins/_ml/models/_registernに対してさっき作った設定JSONを投げつけます。

POST /_plugins/_ml/models/_register
{
	"name": "intfloat/multilingual-e5-large-instruct-v1",
	"version": "1.0.0",
    "model_group_id": "RzSd7I4BBmWZ2-waQ5by",
	"description": "This is a multilingual-e5-large-instruct model: It maps sentences & paragraphs to a 1024 dimensional dense vector space and can be used for tasks like clustering or semantic search.",
	"model_task_type": "TEXT_EMBEDDING",
	"model_format": "ONNX",
	"model_content_size_in_bytes": 1333542970,
	"model_content_hash_value": "8ce76283739afbcbe88b41e976aaf8eff16839d5dfd2a073a9bd3550e02cdd8c",
	"model_config": {
		"pooling_mode": "mean",
		"normalize_result": "true",
		"model_type": "xlm-roberta",
		"embedding_dimension": 1024,
		"framework_type": "huggingface_transformers",
		"all_config": "{\"_name_or_path\": \"intfloat/multilingual-e5-large-instruct\", \"architectures\": [\"XLMRobertaModel\"], \"attention_probs_dropout_prob\": 0.1,\"bos_token_id\": 0, \"classifier_dropout\": null, \"eos_token_id\": 2, \"export_model_type\": \"transformer\",\"hidden_act\": \"gelu\",\"hidden_dropout_prob\": 0.1,\"hidden_size\": 1024,\"initializer_range\": 0.02, \"intermediate_size\": 4096,\"layer_norm_eps\": 1e-05, \"max_position_embeddings\": 514,\"model_type\": \"xlm-roberta\", \"num_attention_heads\": 16,\"num_hidden_layers\": 24, \"output_past\": true,\"pad_token_id\": 1,\"position_embedding_type\": \"absolute\", \"torch_dtype\": \"float16\", \"transformers_version\": \"4.39.3\", \"type_vocab_size\": 1, \"use_cache\": true, \"vocab_size\": 250002 }"
	},
	"created_time": 1676072210947,
    "url":"http://fastapi_for_model/get_file/model/multilingual-e5-large-instruct_v1.zip"
}

task_idがでるのでtaskを確認して完了になるまで待ちます。ここまで来るのにたくさん失敗しました。

GET /_plugins/_ml/tasks/<タスクID>

{
    "model_id": "SAS3FY8Bu7RVyOi7jDQ2",
    "task_type": "REGISTER_MODEL",
    "function_name": "TEXT_EMBEDDING",
    "state": "COMPLETED",
    "worker_node": [
      "nUQIhBClQ9STBFX6wPwKIA"
    ],
    "create_time": 1714056301571,
    "last_update_time": 1714056346307,
    "is_async": true
  }

で完了してたらデプロイします。

POST /_plugins/_ml/models/SAS3FY8Bu7RVyOi7jDQ2/_deploy

完成！

動作確認

あとは前回同様にindexに登録していきます。まずはingest用のパイプラインを作ります。

PUT /_ingest/pipeline/nlp-e5-ingest-pipeline
{
  "description": "A text embedding pipeline",
  "processors": [
    {
      "text_embedding": {
        "model_id": "SAS3FY8Bu7RVyOi7jDQ2",
        "field_map": {
          "text": "embedding"
        }
      }
    }
  ]
}

ここのfield_mapのtextをベクトル化したいフィールド名、embeddingをベクトルを保存したいフィールド名の指定にします。

indexを作ります

PUT /my-nlp-e5
{
  "settings": {
    "index.knn": true,
    "default_pipeline": "nlp-e5-ingest-pipeline"
  },
  "mappings": {
    "properties": {
      "id": {
        "type": "text"
      },
      "embedding": {
        "type": "knn_vector",
        "dimension": 1024,
        "method": {
          "engine": "lucene",
          "space_type": "l2",
          "name": "hnsw",
          "parameters": {}
        }
      },
      "text": {
        "type": "text",
        "analyzer": "kuromoji"
      }
    }
  }
}

デフォルトでnlp-e5-ingest-piplineを使うindexになります。embeddingフィールドは次元1024でe5にあわせてあります。textフィールドはkuromojiで日本語解析がかかります。なのでトータルとして、textに入った文字列は日本語の解析＋ベクトルがemmbddingにはいるようになっています。

で、データ入れます

POST _bulk
{"index": {"_index": "my-nlp-e5", "_id": "1"}}
{"text": "Qiitaで記事書くの久しぶりかも。個人的には雑な記事で許される雰囲気が好き。"}
{"index": {"_index": "my-nlp-e5", "_id": "2"}}
{"text": "テストデータです。"}
{"index": {"_index": "my-nlp-e5", "_id": "3"}}
{"text": "ねむいよ。"}
{"index": {"_index": "my-nlp-e5", "_id": "4"}}
{"text": "サンプルデータで検索引っかかりやすさの差が出るデータを作るのがめんどくさい。"}
{"index": {"_index": "my-nlp-e5", "_id": "5"}}
{"text": "本日は晴天なり"}
{"index": {"_index": "my-nlp-e5", "_id": "6"}}
{"text": "すっごくどうでもいいけど、「アイマリンプロジェクト 「Dive to Blue」 MMD MUSIC VIDEO」がマイブームです。https:\/\/www.youtube.com/watch?v=XCzGs6rLf4s"}

これでデータはいりました。

前回はリランクしてたんですが、cross encoderの性能があまりにもイマイチなのでなしにします。

前回もつかったキーワード3:ベクトル7のハイブリッド検索にします（このPUTはもうやってあるので不要）

PUT /_search/pipeline/nlp-search-pipeline
{
 "description": "Post processor for hybrid search",
 "phase_results_processors": [
   {
     "normalization-processor": {
       "normalization": {
         "technique": "min_max"
       },
       "combination": {
         "technique": "arithmetic_mean",
         "parameters": {
           "weights": [
             0.3,
             0.7
           ]
         }
       }
     }
   }
 ]
}

はい、検索！

POST my-nlp-e5/_search?search_pipeline=nlp-search-pipeline
{ 
  "_source": {
    "exclude": [
      "embedding"
    ]
  },
  "query": {
    "hybrid": {
      "queries": [
        {
          "match": {
            "text": {
              "query": "アイマリンプロジェクト"
            }
          }
        },
        {
          "neural": {
            "embedding": {
              "query_text": "アイマリンプロジェクト",
              "model_id": "SAS3FY8Bu7RVyOi7jDQ2",
              "k": 5
            }
          }
        }
      ]
    }
  }

結果！

{
  "took": 61,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 5,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "my-nlp-e5",
        "_id": "6",
        "_score": 1,
        "_source": {
          "text": "すっごくどうでもいいけど、「アイマリンプロジェクト 「Dive to Blue」 MMD MUSIC VIDEO」がマイブームです。https://www.youtube.com/watch?v=XCzGs6rLf4s"
        }
      },
      {
        "_index": "my-nlp-e5",
        "_id": "3",
        "_score": 0.5152309,
        "_source": {
          "text": "ねむいよ。"
        }
      },
      {
        "_index": "my-nlp-e5",
        "_id": "5",
        "_score": 0.3490331,
        "_source": {
          "text": "本日は晴天なり"
        }
      },
      {
        "_index": "my-nlp-e5",
        "_id": "2",
        "_score": 0.33217168,
        "_source": {
          "text": "テストデータです。"
        }
      },
      {
        "_index": "my-nlp-e5",
        "_id": "1",
        "_score": 0.00070000003,
        "_source": {
          "text": "Qiitaで記事書くの久しぶりかも。個人的には雑な記事で許される雰囲気が好き。"
        }
      }
    ]
  }
}

前回の結果との比較を表にします。
ハイブリッド検索語句：アイマリンプロジェクト
重み　キーワード：ベクトル＝３：７

	前回	今回
モデル名	sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2	intfloat/multilingual-e5-large-instruct
1番目(score, text)	(0.7, 本日は晴天なり)	(1,すごくどうでもいいけど（以下略）)
2番目(score, text)	(0.53,ねむいよ。)	(0.51,ねむいよ。)
3番目(score, text)	(0.39,すごくどうでもいいけど（以下略）)	(0.35,本日は晴天なり)

はい。圧倒的に検索精度がよくなっていることがわかります。

やったね。

AWSのでもエンドポイントさえ何とかなれば同じノリで行けると思う。

あと、たぶんinstructつけたほうが性能でる…。piplineに組み込んでないけど、組み込んだらよい。
https://huggingface.co/intfloat/multilingual-e5-large-instruct#faq

POST my-nlp-e5/_search?search_pipeline=nlp-search-pipeline
{ 
  "_source": {
    "exclude": [
      "embedding"
    ]
  },
  "query": {
    "hybrid": {
      "queries": [
        {
          "match": {
            "text": {
              "query": "踊りが好きです。"
            }
          }
        },
        {
          "neural": {
            "embedding": {
              "query_text": "Instruct: {Given a web search query, retrieve relevant passages that answer the query}\nQuery: {踊りが好きです。}",
              "model_id": "SAS3FY8Bu7RVyOi7jDQ2",
              "k": 5
            }
          }
        }
      ]
    }
  }
}

枠ありが上のクエリ。枠なしは「踊りが好きです。」だけ。

	枠なし	枠あり
1番目(score, text)	(0.7, ねむいよ。)	(0.7,すごくどうでもいいけど（以下略）)
2番目(score, text)	(0.40,Qiitaで記事（以下略）)	(0.61,ねむいよ。)
3番目(score, text)	(0.12,テストデータです。)	(0.51,Qiitaで記事（以下略）)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up