LoginSignup
3
3

Bedrock Embeddingsモデルの日本語データセット性能比較(Titan Embeddings G1/V2、Cohere Embed)

Last updated at Posted at 2024-05-13

Titan Text Embeddings の V2 なるものが GA したので、日本語を取り扱える Embeddings モデルの性能を比較してみたいと思います。

使用データセット

JSTS

JSICK

※他に何か良いのありますかね?マルチリンガルなものはどう使うのが正しいのか

比較用プログラム

日本語データセット各行の日本語1と日本語2をベクトル化してコサイン類似度を計算し、データセットのお手本のスコアとの順位の相関を計算します。

JSTS.py
from langchain_community.embeddings import BedrockEmbeddings
from sklearn.metrics.pairwise import cosine_similarity
import json
from scipy.stats import spearmanr

cosine_ary_1 = []
cosine_ary_2 = []
cosine_ary_3 = []
label_ary = []
with open("valid-v1.1.json", encoding="utf-8") as f:
    for line in f:
        line = json.loads(line)

        # Cohere Embed Multilingual
        embeddings_1 = BedrockEmbeddings(model_id="cohere.embed-multilingual-v3")
        vector_1_1 = embeddings_1.embed_documents([line["sentence1"]])
        vector_1_2 = embeddings_1.embed_documents([line["sentence2"]])
        similarities_1 = cosine_similarity(vector_1_1, vector_1_2).tolist()[0][0]
        cosine_ary_1.append(similarities_1)

        # Titan Embeddings G1 - Text
        embeddings_2 = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1")
        vector_2_1 = embeddings_2.embed_documents([line["sentence1"]])
        vector_2_2 = embeddings_2.embed_documents([line["sentence2"]])
        similarities_2 = cosine_similarity(vector_2_1, vector_2_2).tolist()[0][0]
        cosine_ary_2.append(similarities_2)

        # Titan Text Embeddings V2
        embeddings_3 = BedrockEmbeddings(model_id="amazon.titan-embed-text-v2:0")
        vector_3_1 = embeddings_3.embed_documents([line["sentence1"]])
        vector_3_2 = embeddings_3.embed_documents([line["sentence2"]])
        similarities_3 = cosine_similarity(vector_3_1, vector_3_2).tolist()[0][0]
        cosine_ary_3.append(similarities_3)

        # お手本
        label = line["label"]
        label_ary.append(label)

# お手本との順位の相関

# Embed Multilingual
correlation, pvalue = spearmanr(cosine_ary_1,label_ary)
print("Cohere Embed: ")
print(correlation)

# Titan Embeddings G1 - Text
correlation, pvalue = spearmanr(cosine_ary_2,label_ary)
print("Amazon Titan Embeddings G1 - Text: ")
print(correlation)

# Titan Text Embeddings V2
correlation, pvalue = spearmanr(cosine_ary_3,label_ary)
print("Amazon Titan Text Embeddings V2: ")
print(correlation)
JSICK.py
from langchain_community.embeddings import BedrockEmbeddings
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
from scipy.stats import spearmanr

cosine_ary_1 = []
cosine_ary_2 = []
cosine_ary_3 = []
score_ary = []

df_tsv = pd.read_csv('test.tsv', sep='\t')
for _, line in df_tsv.iterrows():

    # Cohere Embed Multilingual
    embeddings_1 = BedrockEmbeddings(model_id="cohere.embed-multilingual-v3")
    vector_1_1 = embeddings_1.embed_documents([line["sentence_A_Ja"]])
    vector_1_2 = embeddings_1.embed_documents([line["sentence_B_Ja"]])
    similarities_1 = cosine_similarity(vector_1_1, vector_1_2).tolist()[0][0]
    cosine_ary_1.append(similarities_1)

    # Titan Embeddings G1 - Text
    embeddings_2 = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1")
    vector_2_1 = embeddings_2.embed_documents([line["sentence_A_Ja"]])
    vector_2_2 = embeddings_2.embed_documents([line["sentence_B_Ja"]])
    similarities_2 = cosine_similarity(vector_2_1, vector_2_2).tolist()[0][0]
    cosine_ary_2.append(similarities_2)

    # Titan Text Embeddings V2
    embeddings_3 = BedrockEmbeddings(model_id="amazon.titan-embed-text-v2:0")
    vector_3_1 = embeddings_3.embed_documents([line["sentence_A_Ja"]])
    vector_3_2 = embeddings_3.embed_documents([line["sentence_B_Ja"]])
    similarities_3 = cosine_similarity(vector_3_1, vector_3_2).tolist()[0][0]
    cosine_ary_3.append(similarities_3)

    # お手本
    score = line["relatedness_score_Ja"]
    score_ary.append(score)

# お手本との順位の相関

# Embed Multilingual
correlation, pvalue = spearmanr(cosine_ary_1,score_ary)
print("Cohere Embed: ")
print(correlation)

# Titan Embeddings G1 - Text
correlation, pvalue = spearmanr(cosine_ary_2,score_ary)
print("Amazon Titan Embeddings G1 - Text: ")
print(correlation)

# Titan Text Embeddings V2
correlation, pvalue = spearmanr(cosine_ary_3,score_ary)
print("Amazon Titan Text Embeddings V2: ")
print(correlation)

結論

モデル JSTS JSICK
Cohere Embed 0.8144 0.7726
Amazon Titan Embeddings G1 - Text 0.7293 0.7706
Amazon Titan Text Embeddings V2 0.7805 0.7680

JSTSによると Titan Embeddings V2 は順調に進歩してますね…

Cohere Embed > Titan Embeddings V2 > Titan Embeddings G1

と言いたかったところですが、JSICK では G1 に負けてますね。

Cohere Embed > Titan Embeddings G1 > Titan Embeddings V2

誤差みたいなものでしょうか。

Titan Embeddings は扱えるトークン数が多く、チャンクサイズを大きく取ることが出来るので、ナレッジベースで日本語を扱う時は Titan Embeddings V2 を使ってみたいですね。

実行結果

JSTS

Cohere Embed:
0.8144365311349437

Amazon Titan Embeddings G1 - Text:
0.7293064575666538

Amazon Titan Text Embeddings V2:
0.7805891835451588

JSICK

Cohere Embed:
0.7726640670055452

Amazon Titan Embeddings G1 - Text:
0.7706490119268313

Amazon Titan Text Embeddings V2:
0.7680714674118668

3
3
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
3
3