Elasticsearch/OpenSearchでChatGPT Embeddingsの類似度検索をする

Last updated at 2024-08-13Posted at 2024-08-02

はじめに

前回はAWS MemoryDBを使ったEmbeddingの計算しました。

今回はElasticsearch/OpenSearchでもベクトル検索ができるようなので調査しました。

OpenSearch構築

サーバー構築自体は割愛します。
ボタンポチで作成できるので他の資料を参照ください。
今回は社内の開発環境で構築しているサーバーを拝借します。

Indexを作成する

Elasticsearch本家の資料より

// Indexの作成
PUT vector-search-test
{
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "dense_vector",
        "dims": 1536,
        "index": true,
        "similarity": "cosine"

      },
      "my_text" : {
        "type" : "text"
      }
    }
  }
}

をOpenSearchに投げてみます。
おー。。。動かない。

{
  "error": {
    "root_cause": [
      {
        "type": "mapper_parsing_exception",
        "reason": "No handler for type [dense_vector] declared on field [my_vector]"
      }
    ],
    "type": "mapper_parsing_exception",
    "reason": "Failed to parse mapping [_doc]: No handler for type [dense_vector] declared on field [my_vector]",
    "caused_by": {
      "type": "mapper_parsing_exception",
      "reason": "No handler for type [dense_vector] declared on field [my_vector]"
    }
  },
  "status": 400
}

なので、AWSのOpenSearchの資料を見てみます。

ドキュメントに沿って投げてみます。
dimensionはOpenAI Embeddingの1536次元とします。

// Index作成する
PUT vector-search-test
{
   "settings": {
      "index.knn": true
   },
   "mappings": {
      "properties": {
         "my-vector": {
            "type": "knn_vector",
            "dimension": 1536
         }
      }
   }
}

無事作成できました。

// 結果
{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "vector-search-test"
}

Vectorデータを格納してみる

同様にAWSのドキュメントに沿ってChatGPT Embeddingsのデータを投入してみます。

POST vector-search-test/_doc
{
  "my-vector": [
    0.024686571,
    -0.024063775,
   ....
    -0.0026369444,
    0.006864006
  ]
}

お、無事、入りました。

{
  "_index": "vector-search-test",
  "_id": "Z-B9DJEBYxeOUfNsDKpU",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}

Vector類似度検索

次に検索してみます。

GET vector-search-test/_search
{
  "size": 5,
  "query": {
    "knn": {
      "my-vector": {
        "vector": [
0,
0,
0,
...
0,
0,
0
        ],
        "k": 5
      }
    }
  }
}

無事に抽出できました。
"_score": 0.50000006
の部分にコサイン類似度が計算されるようです。

{
  "took": 53,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.50000006,
    "hits": [
      {
        "_index": "vector-search-test",
        "_id": "Z-B9DJEBYxeOUfNsDKpU",
        "_score": 0.50000006,
        "_source": {
          "my-vector": [
            0.024686571,
            -0.024063775,
            ...,
            -0.0026369444,
            0.006864006
          ]
        }
      }
    ]
  }
}

大量データ投入

次に大量データ入れてみます。
ChatGPTにOpenSearch用のサンプルコードを書かせます。

OpenSearchにPHPでアクセスするサンプルコードを教えて

OpenSearchにPHPでアクセスするためのサンプルコードを紹介します。OpenSearchはAmazon Elasticsearch Serviceをオープンソース化したもので、Elasticsearchと同様のAPIを提供しています。以下のサンプルコードでは、PHPでOpenSearchクラスターにデータを追加し、検索する方法を示します。

必要なパッケージのインストール
まず、PHPのHTTPクライアントライブラリであるGuzzleをインストールします。Composerを使用してインストールします。

composer require guzzlehttp/guzzle

サンプルコード

<?php
require 'vendor/autoload.php';

use GuzzleHttp\Client;
use GuzzleHttp\Exception\RequestException;

class OpenSearchClient
{
    private $client;
    private $index;
    private $type;

    public function __construct($host, $index, $type)
    {
        $this->client = new Client(['base_uri' => $host]);
        $this->index = $index;
        $this->type = $type;
    }

    public function indexDocument($id, $document)
    {
        try {
            $response = $this->client->request('PUT', "/{$this->index}/{$this->type}/$id", [
                'json' => $document
            ]);
            return json_decode($response->getBody(), true);
        } catch (RequestException $e) {
            if ($e->hasResponse()) {
                return json_decode($e->getResponse()->getBody(), true);
            }
            return ['error' => $e->getMessage()];
        }
    }

    public function search($query)
    {
        try {
            $response = $this->client->request('GET', "/{$this->index}/_search", [
                'json' => $query
            ]);
            return json_decode($response->getBody(), true);
        } catch (RequestException $e) {
            if ($e->hasResponse()) {
                return json_decode($e->getResponse()->getBody(), true);
            }
            return ['error' => $e->getMessage()];
        }
    }
}

$host = 'http://localhost:9200';
$index = 'my_index';
$type = '_doc';

$client = new OpenSearchClient($host, $index, $type);

// ドキュメントの追加
$id = 1;
$document = [
    'title' => 'Sample Document',
    'content' => 'This is a sample document for OpenSearch.'
];
$response = $client->indexDocument($id, $document);
print_r($response);

// ドキュメントの検索
$query = [
    'query' => [
        'match' => [
            'content' => 'sample'
        ]
    ]
];
$response = $client->search($query);
print_r($response);

このサンプルコードでは、以下の操作を行っています。

OpenSearchClient クラスの作成: OpenSearchクラスターに対するリクエストを処理するためのクライアントクラスを作成します。
indexDocument メソッド: ドキュメントを指定したインデックスに追加します。
search メソッド: 指定したクエリでドキュメントを検索します。
Guzzle HTTP クライアント: OpenSearchクラスターに対するHTTPリクエストを送信するために使用します。
このコードを実行するには、OpenSearchクラスターが動作している必要があります。ローカル環境で実行する場合は、Dockerを使用してOpenSearchをセットアップすることができます。

おー。すごい書いてくれる。
これを使ってお手元のEmbedding情報を1万件を入れてみます。

無事入りました。

ファイルサイズ的に250MBぐらいでした。

ベンチマーク計測

平均値だしてください。

OpenSearch処理時間: 0.15063285827637sec
OpenSearch処理時間: 0.1567530632019sec
OpenSearch処理時間: 0.16443109512329sec
OpenSearch処理時間: 0.15295004844666sec
OpenSearch処理時間: 0.15922808647156sec
OpenSearch平均時間: 0.1568秒

速度比較

かかった時間を計算手法別にグラフにしてみます。

Oh, MemoryDBはもはや見えない。。。
単位をmsにして、対数グラフにしてみます。
いい感じのグラフになりました。
やはりMemory DB (RedisSearch) は早いですね。

容量比較

ざっとこんな感じになりました。
容量は冗長化構成を取ってない状態の比較になります。
OpenSearchの方が若干容量食いますね。

コスト比較

別途制約色々あるので注意ですが、ざっくり下記の通りです。

InstanceType	Amazon MemoryDB	Amazon OpenSearch	Amazon RDS
*.r#g.large	USD 0.371	USD 0.202	USD 0.313
*.r#g.16xlarge	USD 11.811	USD 6.46	USD 10.024
*.t#g.small	USD 0.074	USD 0.056	USD 0.057
*.t#g.medium	USD 0.147	USD 0.112	USD 0.113
ストレージ料金	メモリ容量に依存	USD 0.1464	USD 0.12

一部概算で計算してます。（無いプランがあるので注意です。）

結論

既にOpenSearch使ってる場合は手っ取り早いのでOpenSearchに入れてみるのが早いです。
あとは使う容量と求める速度とお財布に相談しながら進めると良いですね。
MySQL 9 / MariaDB Vectore 11.6 もベクトル検索できるようになるそうなのでそれも合わせて今後比較検証できればと！

Appendix

下記のエラーが出た場合は、

cURL error 60: SSL: no alternative certificate subject name matches target host name 'localhost'
を無視する方法

ポートフォワードとかでOpenSearchにアクセスする場合は下記で証明書をSKIPさせましょう。

$this->client = new Client([
    'base_uri' => $host,
    'verify' => false, // SSL証明書の検証を無視
]);

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up