More than 1 year has passed since last update.

Amazon Bedrockで使えるようになるらしいCohereのEmbeddingsを試してみた

Last updated at 2023-09-17Posted at 2023-09-17

最近OpenA以外のLLMと戯れるのが楽しい私です。

Cohereとは

CohereがAmazon Bedrockに追加された際のAWSブログから引用します。

Cohere は、エンタープライズ AI プラットフォームと最先端の基盤モデルの開発を手掛ける大手企業で、同社の基盤モデルは、情報を生成、検索、要約するためのより直感的な方法を探索するのに適しています。Cohere の言語生成モデルである Command は、ユーザーの指示に従うようにトレーニングされており、要約、コピーライティング、対話、抽出、質問への応答といった実用的なビジネスアプリケーションですぐに活用することができます。Cohere の言語解釈モデルである Embed は、100 以上の言語にまたがる検索、クラスタリング、分類のタスクに使用でき、意味による検索やテキスト分類を容易に行えます。

Cohereは無料プランが用意されています。学習とプロトタイピング用途であれば気軽に試すことができます。

CohereのEmbeddingsを試した

こちらのブログを試してみました。

準備

ライブラリーのインストール

pip install cohere langchain qdrant-client tfds-nightly tensorflow-cpu

Example 1: Basic Multilingual Search

テストデータを準備します。質問文が5,452個のデータセットです。

import tensorflow_datasets as tfds
dataset = tfds.load('trec', split='train')
texts = [item['text'].decode('utf-8') for item in tfds.as_numpy(dataset)]

テストデータはこんな感じです。

Who was Camp David named for ?
What is the C programming language ?
Where is the oldest living thing on earth ?
How many claws has a lobster called a pistol lost ?
What son of a 15-year-old Mexican girl and a half-Irish father became the world 's most famous Greek ?

ドキュメントをEmbeddingsしてインデックスを保存します。
インデックスの保存先はQdrantというベクトルデータベースを使用しています。
distance_funcのDotはドット積を指定しています。

# Define the embeddings model
embeddings = CohereEmbeddings(model = "multilingual-22-12")

# Embed the documents and store in index
db = Qdrant.from_texts(texts, embeddings, location=":memory:", collection_name="my_documents", distance_func="Dot")

インデックスの保存ができましたので、クエリーします。
Cohereは多言語に対応しているので様々な言語でクエリーが可能です。

クエリー文字列	言語
How to get in touch with Bill Gates	英語
Comment entrer en contact avec Bill Gates	フランス語
Cara menghubungi Bill Gates	インドネシア語

queries = ["How to get in touch with Bill Gates",
           "Comment entrer en contact avec Bill Gates",
           "Cara menghubungi Bill Gates"]

queries_lang = ["English", "French", "Indonesian"] 

answers = []
for query in queries:
  docs = db.similarity_search(query)
  answers.append(docs[0].page_content)

いずれもクエリー結果としてWhat is Bill Gates of Microsoft E-mail address ?というものが返ってきました。

Query language: English
Query: How to get in touch with Bill Gates
Most similar existing question: What is Bill Gates of Microsoft E-mail address ?
-------------------- 

Query language: French
Query: Comment entrer en contact avec Bill Gates
Most similar existing question: What is Bill Gates of Microsoft E-mail address ?
-------------------- 

Query language: Indonesian
Query: Cara menghubungi Bill Gates
Most similar existing question: What is Bill Gates of Microsoft E-mail address ?
--------------------

日本語でビル・ゲイツと連絡を取るにはでクエリーすると、How can I get in touch with Michael Moore of `` Roger & Me '' ?と返ってきました。。たまたまなのか日本語が得意でないのかはわからずです。

Example 2: Search-Based Question Answering

次はセマンティック検索を使用して、長いドキュメントに対するQ&Aを試します。

対象のドキュメントとしてスティーブ・ジョブズがスタンフォード大学で2005年に行った卒業式でのスピーチを使用します。

wget 'https://docs.google.com/uc?export=download&id=1f1INWOfJrHTFmbyF_0be5b4u_moz3a4F' -O steve-jobs-commencement.txt

ドキュメントを500文字のチャンクに分割します。

loader = TextLoader("steve-jobs-commencement.txt")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

ベクトルデータベースにインデックス化して保存します。

embeddings = CohereEmbeddings(model = "multilingual-22-12")
db = Qdrant.from_documents(texts, embeddings, location=":memory:", collection_name="my_documents", distance_func="Dot")

質問文を用意します。3つはドキュメントにある内容、1つが正解がない質問、1つがドキュメント中には記載のない内容です。

questions = [
           "What did the author liken The Whole Earth Catalog to?",
           "What was Reed College great at?",
           "What was the author diagnosed with?",
           "What is the key lesson from this article?",
           "What did the article say about Michael Jackson?",
           ]

プロンプトを準備します。答えがない場合は「答えがない」と回答するように指示します。

prompt_template = """Text: {context}

Question: {question}

Answer the question based on the text provided. If the text doesn't contain the answer, reply that the answer is not available."""


PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

質問を行います。

chain_type_kwargs = {"prompt": PROMPT}

qa = RetrievalQA.from_chain_type(llm=Cohere(model="command-nightly", temperature=0), 
                                 chain_type="stuff", 
                                 retriever=db.as_retriever(), 
                                 chain_type_kwargs=chain_type_kwargs, 
                                 return_source_documents=True)

for question in questions:
  answer = qa({"query": question})
  result = answer["result"].replace("\n","").replace("Answer:","")
  sources = answer['source_documents']
  print("-"*150,"\n")
  print(f"Question: {question}")
  print(f"Answer: {result}")

  ### COMMENT OUT THE 4 LINES BELOW TO HIDE THE SOURCES
  print(f"\nSources:")
  for idx, source in enumerate(sources):
    source_wrapped = tr.fill(str(source.page_content), width=150)
    print(f"{idx+1}: {source_wrapped}")

回答

------------------------------------------------------------------------------------------------------------------------------------------------------ 

Question: What did the author liken The Whole Earth Catalog to?
Answer:  The author likened The Whole Earth Catalog to Google in paperback form.

------------------------------------------------------------------------------------------------------------------------------------------------------ 

Question: What was Reed College great at?
Answer:  Reed College was great at calligraphy instruction.

------------------------------------------------------------------------------------------------------------------------------------------------------ 

Question: What was the author diagnosed with?
Answer:  The author was diagnosed with pancreatic cancer.

------------------------------------------------------------------------------------------------------------------------------------------------------ 

Question: What is the key lesson from this article?
Answer:  The key lesson from this article is to trust in your gut and follow your heart and intuition. It is also important to not let the opinions of others drown out your own inner voice and to not waste your time living someone else's life.

------------------------------------------------------------------------------------------------------------------------------------------------------ 

Question: What did the article say about Michael Jackson?
Answer:  The text provided is not related to Michael Jackson.

LangChainがCohereに対応しているため、OpenAIのEmbeddingsと同じ使用感で使えることがわかりました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up