More than 1 year has passed since last update.

Embeddingする方法、コサイン類似度を取る方法によってそれぞれ異なる値について

Last updated at 2023-11-17Posted at 2023-11-15

概要

ベクトル化(Embedding)を行う方法がいくつかある中で、3つの方法を試しました。そしてそれらの結果が違っていることを確認しました。
コサイン類似度を取る方法もいくつかある中で、3つの方法を試しました。そしてこれらはほとんど同じ結果ですが、誤差レベルで違いが生じます。

環境

❯ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.2 LTS"

Python

python = ">=3.10,<3.13"
tiktoken = "^0.4.0"
openai = "^0.27.9"
langchain = "^0.0.272"
pandas = "^2.1.0"
azure-search-documents = { version = "11.4.0a20230509004", source = "azure-sdk-for-python" }
spacy = ">=3.2.0,<3.7.0"
tqdm = "^4.66.1"
ginza = "^5.1.3"
ja-ginza = "^5.1.3"
scipy = "^1.11.3"
scikit-learn = "^1.3.2"

ベクトル化を行う方法

ベクトル化とそのコサイン類似度を比較するに当たって、下記2つのテキストを用意しました。

text1 = "Azure OpenAI は Microsoft と OpenAI のパートナーシップによって提供されるサービスです。"
text2 = "Azure OpenAI は Microsoft と OpenAI の共同開発による人工知能のプラットフォームです。"

上記のテキスト2つをベクトル化するにあたって以下の3つの方法を用いました。

spacy
text-embedding-ada-002
text-similarity-ada-001

spacyのケース

    import spacy
    
    nlp = spacy.load('ja_ginza')
    vector1 = nlp(text1)
    vector2 = nlp(text2)

text-embedding-ada-002のケース

Langchainを使って求めます。

    from langchain.embeddings.openai import OpenAIEmbeddings

    def get_vector_index(self, content: str) -> List[float]:
        api_type = os.getenv("OPENAI_API_TYPE")
        api_base = os.getenv("OPENAI_API_BASE")
        embedding_model_name = os.getenv("text-embedding-ada-002")
        embedding_deploy_name = os.getenv("OPENAI_EMBEDDING_DEPLOYMENT_NAME")
        try:
            embeddings = OpenAIEmbeddings(
                deployment=embedding_deploy_name,
                model=embedding_model_name,
                openai_api_base=api_base,
                openai_api_type=api_type,
            )
            result = embeddings.embed_query(content)
            return result
        except Exception as e:
            raise IOError(f"Something went wrong in vectorConvertor: {e}")

    vector1 = vc.get_vector_index(text1)
    vector2 = vc.get_vector_index(text2)

text-similarity-ada-001のケース

上記の embedding_model_name を変更するのみで以下の通りです。

    from langchain.embeddings.openai import OpenAIEmbeddings

    def get_vector_index(self, content: str) -> List[float]:
        api_type = os.getenv("OPENAI_API_TYPE")
        api_base = os.getenv("OPENAI_API_BASE")
        embedding_model_name = os.getenv("text-similarity-ada-001")
        embedding_deploy_name = os.getenv("OPENAI_EMBEDDING_DEPLOYMENT_NAME")
        try:
            embeddings = OpenAIEmbeddings(
                deployment=embedding_deploy_name,
                model=embedding_model_name,
                openai_api_base=api_base,
                openai_api_type=api_type,
            )
            result = embeddings.embed_query(content)
            return result
        except Exception as e:
            raise IOError(f"Something went wrong in vectorConvertor: {e}")

    vector1 = vc.get_vector_index(text1)
    vector2 = vc.get_vector_index(text2)

以上3つの方法でそれぞれ生成されるベクトルは異なるのですが、いささか分量が多いので、ここでは生成されるベクトルの内容は割愛します。

コサイン類似度をとる方法

以下の3つの方法を試しました。

spacy
scikit-learn
scipy

spacyのケース

ベクトル化の続きになるのですが、下記の通りの流れです。

import spacy

nlp = spacy.load('ja_ginza')
vector1 = nlp(text1)
vector2 = nlp(text2)
result = vector1.similarity(vector2)

# result: 0.8977365414872635

spacyではspacyで作られたオブジェクトを比較してコサイン類似度を取る方式のようで、その他2つとは違って独自方式になります。

これによって得られた値は0.8977365414872635です。

scikit-learnのケース

text-similarity-ada-001によって得られたベクトルデータを比較

下記のような流れでコサイン類似度を取得できます。

from sklearn.metrics.pairwise import cosine_similarity

vector1 = [vc.get_vector_index(text1)]
vector2 = [vc.get_vector_index(text2)]
result = cosine_similarity(vector1, vector2)

# result: 0.9215424046489072

cosine_similarityを使う場合、vector1とvector2の入力がそれぞれ二次元配列でないといけないようで、その作法に合わせています。

この方法で取得されたコサイン類似度は0.9215424046489072です。

ここでのcosine_similarity()の使い方は以下の記事を参考にさせていただきました。

cos類似度計算を高速に行う #Python - Qiita
https://qiita.com/norikatamari/items/13b3bb8b67f2e7f9400f

text-embedding-ada-002によって得られたベクトルデータを比較

そもそものベクトルの値が違うので、結果が違うのは当然ですが、下記の通りになりました。

from sklearn.metrics.pairwise import cosine_similarity

vector1 = [vc.get_vector_index(text1)]
vector2 = [vc.get_vector_index(text2)]
result = cosine_similarity(vector1, vector2)

# result: 0.964142529515918

scipyのケース

text-similarity-ada-001によって得られたベクトルデータを比較

from scipy.spatial.distance import cosine

vector1 = vc.get_vector_index(text1)
vector2 = vc.get_vector_index(text2)
result = 1 - cosine(vector1, vector2)

# result: 0.9215424046489066

scikit-learnのcosine_similarityで取得した値とかなり近いですが、微妙に異なります。

ここでの値の求め方は下記2つの記事を参考にさせていただきました。

コサイン距離を求めるのに、１を引いている理由がわからない
https://teratail.com/questions/80140

scipy.spatial. distance.cosine — SciPy v1.11.3 マニュアル
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html

text-embedding-ada-002によって得られたベクトルデータを比較

from scipy.spatial.distance import cosine

vector1 = vc.get_vector_index(text1)
vector2 = vc.get_vector_index(text2)
result = 1 - cosine(vector1, vector2)

# result: 0.9641425295159185

scikit-learnのcosine_similarityよりも有効桁数が一桁多いだけという印象です。

spacyで取得したベクトルデータをscikit-learnとscipyでコサイン類似度計算

ちなみに、spacyで得られたベクトルデータに対して、scikit-learnとscipyを使ってコサイン類似度を取得してもみました。

下記コードを使いました。

import spacy
from scipy.spatial.distance import cosine
from sklearn.metrics.pairwise import cosine_similarity


nlp = spacy.load('ja_ginza')
vector1 = nlp(text1).vector.tolist()
vector2 = nlp(text2).vector.tolist()
result1 = 1 - cosine(vector1, vector2)
result2 = cosine_similarity([vector1], [vector2])

# result1: 0.8977365705687558
# result2: 0.89773657

やはり近い値が出ますね。

そしてspacy上でコサイン類似度を取得した時の値は0.8977365414872635でしたので、違いは誤差レベルで、ほぼ同じ計算が行われていることが分かります。

まとめ

spacy
text-embedding-ada-002
text-similarity-ada-001

でそれぞれ得られるベクトルデータは異なる。

spacy
scikit-learn
scipy

で取得するコサイン類似度はほぼ同じものが取れる。

終わって見ると当たり前な気がする結果が出ましたが、ベクトルデータを取得する際にどの方法を選ぶのが一番良いのか、悩ましい問題ですね。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up