More than 1 year has passed since last update.

CTranslate2でembeddingモデルを楽に使う

Posted at 2023-09-29

導入

以下の記事でCTranslate2でEmbeddingモデルを使う(Sentence Transformersを使う)をやってみましたが、hf_hub_ctranslate2モジュールを使うとこれがサクッとできるようなのでやってみました。

検証は前回同様Databricks環境下で行っています。

試す

モジュールをインストールします。

%pip install -U -qq transformers accelerate ctranslate2 hf_hub_ctranslate2[sentence_transformers] sentence-transformers

dbutils.library.restartPython()

hf_hub_ctranslate2[sentence_transformers]をインストールしているところがポイントです。

というわけで、このモジュールを使ってベクトル計算します。

from hf_hub_ctranslate2.ct2_sentence_transformers import CT2SentenceTransformer

def main():
    sentences = ["This is an example sentence", "Each sentence is converted"]
    model = CT2SentenceTransformer("/Volumes/hugginfaceのモデルが置いてあるパス/intfloat/multilingual-e5-large/")
    embeddings = model.encode(sentences)
    print(embeddings)

if __name__ == "__main__":
    main()

結果

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at /Volumes/training/llm/ctranslate2_models/intfloat/multilingual-e5-large/ and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[[ 0.0269323  -0.01583179 -0.02195328 ... -0.02472305 -0.02929592
   0.0164552 ]
 [ 0.02027375 -0.01176006 -0.00298298 ... -0.019228   -0.03131687
   0.03152583]]

前回と同じ結果が得られました。

こちらのコードを見ている限り、変換済みモデルも指定できるようなのですが、Transformersのモデルが同じディレクトリ内に無いとエラーが出るのは変わりません。

まとめ

というわけで、こっちのやり方が楽です。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up