Titan Text Embeddings の V2 なるものが GA したので、日本語を取り扱える Embeddings モデルの性能を比較してみたいと思います。
使用データセット
JSTS
JSICK
※他に何か良いのありますかね?マルチリンガルなものはどう使うのが正しいのか
比較用プログラム
日本語データセット各行の日本語1と日本語2をベクトル化してコサイン類似度を計算し、データセットのお手本のスコアとの順位の相関を計算します。
from langchain_community.embeddings import BedrockEmbeddings
from sklearn.metrics.pairwise import cosine_similarity
import json
from scipy.stats import spearmanr
cosine_ary_1 = []
cosine_ary_2 = []
cosine_ary_3 = []
label_ary = []
with open("valid-v1.1.json", encoding="utf-8") as f:
for line in f:
line = json.loads(line)
# Cohere Embed Multilingual
embeddings_1 = BedrockEmbeddings(model_id="cohere.embed-multilingual-v3")
vector_1_1 = embeddings_1.embed_documents([line["sentence1"]])
vector_1_2 = embeddings_1.embed_documents([line["sentence2"]])
similarities_1 = cosine_similarity(vector_1_1, vector_1_2).tolist()[0][0]
cosine_ary_1.append(similarities_1)
# Titan Embeddings G1 - Text
embeddings_2 = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1")
vector_2_1 = embeddings_2.embed_documents([line["sentence1"]])
vector_2_2 = embeddings_2.embed_documents([line["sentence2"]])
similarities_2 = cosine_similarity(vector_2_1, vector_2_2).tolist()[0][0]
cosine_ary_2.append(similarities_2)
# Titan Text Embeddings V2
embeddings_3 = BedrockEmbeddings(model_id="amazon.titan-embed-text-v2:0")
vector_3_1 = embeddings_3.embed_documents([line["sentence1"]])
vector_3_2 = embeddings_3.embed_documents([line["sentence2"]])
similarities_3 = cosine_similarity(vector_3_1, vector_3_2).tolist()[0][0]
cosine_ary_3.append(similarities_3)
# お手本
label = line["label"]
label_ary.append(label)
# お手本との順位の相関
# Embed Multilingual
correlation, pvalue = spearmanr(cosine_ary_1,label_ary)
print("Cohere Embed: ")
print(correlation)
# Titan Embeddings G1 - Text
correlation, pvalue = spearmanr(cosine_ary_2,label_ary)
print("Amazon Titan Embeddings G1 - Text: ")
print(correlation)
# Titan Text Embeddings V2
correlation, pvalue = spearmanr(cosine_ary_3,label_ary)
print("Amazon Titan Text Embeddings V2: ")
print(correlation)
from langchain_community.embeddings import BedrockEmbeddings
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
from scipy.stats import spearmanr
cosine_ary_1 = []
cosine_ary_2 = []
cosine_ary_3 = []
score_ary = []
df_tsv = pd.read_csv('test.tsv', sep='\t')
for _, line in df_tsv.iterrows():
# Cohere Embed Multilingual
embeddings_1 = BedrockEmbeddings(model_id="cohere.embed-multilingual-v3")
vector_1_1 = embeddings_1.embed_documents([line["sentence_A_Ja"]])
vector_1_2 = embeddings_1.embed_documents([line["sentence_B_Ja"]])
similarities_1 = cosine_similarity(vector_1_1, vector_1_2).tolist()[0][0]
cosine_ary_1.append(similarities_1)
# Titan Embeddings G1 - Text
embeddings_2 = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1")
vector_2_1 = embeddings_2.embed_documents([line["sentence_A_Ja"]])
vector_2_2 = embeddings_2.embed_documents([line["sentence_B_Ja"]])
similarities_2 = cosine_similarity(vector_2_1, vector_2_2).tolist()[0][0]
cosine_ary_2.append(similarities_2)
# Titan Text Embeddings V2
embeddings_3 = BedrockEmbeddings(model_id="amazon.titan-embed-text-v2:0")
vector_3_1 = embeddings_3.embed_documents([line["sentence_A_Ja"]])
vector_3_2 = embeddings_3.embed_documents([line["sentence_B_Ja"]])
similarities_3 = cosine_similarity(vector_3_1, vector_3_2).tolist()[0][0]
cosine_ary_3.append(similarities_3)
# お手本
score = line["relatedness_score_Ja"]
score_ary.append(score)
# お手本との順位の相関
# Embed Multilingual
correlation, pvalue = spearmanr(cosine_ary_1,score_ary)
print("Cohere Embed: ")
print(correlation)
# Titan Embeddings G1 - Text
correlation, pvalue = spearmanr(cosine_ary_2,score_ary)
print("Amazon Titan Embeddings G1 - Text: ")
print(correlation)
# Titan Text Embeddings V2
correlation, pvalue = spearmanr(cosine_ary_3,score_ary)
print("Amazon Titan Text Embeddings V2: ")
print(correlation)
結論
モデル | JSTS | JSICK |
---|---|---|
Cohere Embed | 0.8144 | 0.7726 |
Amazon Titan Embeddings G1 - Text | 0.7293 | 0.7706 |
Amazon Titan Text Embeddings V2 | 0.7805 | 0.7680 |
JSTSによると Titan Embeddings V2 は順調に進歩してますね…
Cohere Embed > Titan Embeddings V2 > Titan Embeddings G1
と言いたかったところですが、JSICK では G1 に負けてますね。
Cohere Embed > Titan Embeddings G1 > Titan Embeddings V2
誤差みたいなものでしょうか。
Titan Embeddings は扱えるトークン数が多く、チャンクサイズを大きく取ることが出来るので、ナレッジベースで日本語を扱う時は Titan Embeddings V2 を使ってみたいですね。
実行結果
JSTS
Cohere Embed:
0.8144365311349437
Amazon Titan Embeddings G1 - Text:
0.7293064575666538
Amazon Titan Text Embeddings V2:
0.7805891835451588
JSICK
Cohere Embed:
0.7726640670055452
Amazon Titan Embeddings G1 - Text:
0.7706490119268313
Amazon Titan Text Embeddings V2:
0.7680714674118668