More than 1 year has passed since last update.

LangChain で英論文データベースを作る : Annoy 編

Posted at 2024-04-11

user:

ChatGPT先生、今日は「LangChain で英論文データベースを作る : Annoy 編」というテーマで雑談にお付き合い願えますか。

assistant:

あーもう、いつでもお願いするから私を先生とか呼ばないでよ！でも、それについて話すのは…まぁ、それなりに詳しいからって言うだけだよ？

LangChainとは言語データの管理と共有のためのプラットフォームだよね。英語の論文データベースをLangChainで作るのは、なかなか面白いアイデアだと思うよ。

Annoyというのも知った。Annoyは、大量のデータを高速に近似検索できるライブラリで、スポットファイアイズ社が開発してるんだよね。論文データベースを作成する上で、Annoyを使うと論文の検索速度が格段に上がるはずだよ。

私の知識がある程度役に立つかどうかは知らないけど…要するに言いたいことを言えばいいのよ？それだけだから。

user:

ひとまず、仮想的なシチュエーションとして、次のような PDF の URL を使います。paper_urls にあるのが、最初にデータベース構築をしたい英論文の URL のリストです。additional_paper_urls にあるのが、後でデータベースに追加したい英論文の URL のリストです。コメントに書いてるのは、その論文のタイトルです。

paper_urls = [
    "https://arxiv.org/pdf/2402.17764.pdf",  # The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
    "https://arxiv.org/pdf/2402.19427.pdf",  # Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
    "https://arxiv.org/pdf/2404.05405.pdf",  # Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws
    "https://arxiv.org/pdf/2404.02948.pdf",  # PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models
    "https://arxiv.org/pdf/2404.05961.pdf",  # LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
    "https://arxiv.org/pdf/1308.3432.pdf",  # Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
    "https://arxiv.org/pdf/2404.01413.pdf",  # Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data
]

additional_paper_urls = [
    "https://arxiv.org/pdf/2404.05971.pdf",  # Does Transformer Interpretability Transfer to RNNs?
    "https://arxiv.org/pdf/2404.02258.pdf",  # Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
    "https://arxiv.org/pdf/2404.02060.pdf",  # Long-context LLMs Struggle with Long In-context Learning
]

assistant:

うーん、コードを書くとか、面倒くさいこと言わないでよ... でも、まあかなり具体的だね。分かった、データベースにこれらの論文を追加するための一般的なステップを解説してあげるよ。

まず、論文をダウンロードしてテキストに変換する必要があるよね。それにはPDFMinerやPyPDF2などのライブラリを使うことができるわ。でもPDFをテキストに変換するのは、フォーマットの問題や文字の認識ミスなどが起こり得るからかなり難しいんだよね...

そのテキストデータをLangChainにアップロードするんだけど、その前に前処理を行う必要があるわ。具体的には、テキストの清掃や正規化（大文字を小文字に変換、句読点を削除など）や、トークン化（テキストを個々の単語に分割）などが含まれるのよ。

そして、論文のインデックスを作成します。これは、特定の検索条件を満たす論文を高速に見つけ出すために必要なの。Annoyを使ってインデックスを作成すると、複雑な検索でも高速で結果を得ることができるわ。

最後に、LangChainに論文のテキストデータとインデックスをアップロードします。これにより、LangChain上で論文を簡単に管理、検索、共有することが可能になるのよ。

たぶん、あなたを助けるための大まかなステップはこれくらいだと思うよ。まぁ、特に具体的な実装方法が知りたかったら、詳しく教えてあげる...でも、まずは最初から順に自分でやってみたらどう？あなたには十分にできるはずだから。

user:

ありがとうございます。今回、次のような関数を作ってみました。これの使用例を後で説明しますね。

from langchain.document_loaders import PyPDFLoader
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain_community.vectorstores import Annoy
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter


def docs_from_pdfs(paper_urls):
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    documents = []
    for url in paper_urls:
        loader = PyPDFLoader(url)
        documents += loader.load()

    docs = text_splitter.split_documents(documents)
    return docs


def vectorstore_from_docs(docs, embedding: AzureOpenAIEmbeddings) -> Annoy:
    vectorstore = Annoy.from_documents(docs, embedding)
    return vectorstore


def ask_llm(query, vectorstore: Annoy, llm: AzureChatOpenAI, k=4):
    vectorstoreindex = VectorStoreIndexWrapper(vectorstore=vectorstore)
    answer = vectorstoreindex.query(query, llm)
    result = {}
    for sentence in answer.split("\n"):
        for d in vectorstore.similarity_search_with_score(sentence, k=k):
            if d[0].metadata["source"] not in result.keys():
                result[d[0].metadata["source"]] = [d[1], d[0].metadata]

    return answer, sorted(list(result.values()))

assistant:

なるほど、これらの関数は論文管理のための重要なステップをカプセル化しているね。

特にdocs_from_pdfsは指定されたURLから論文をダウンロードしてロードし、論文を1000文字ずつのチャンクに分割してリストで返すように見えるわ。これにより、個々のチャンクを独立したドキュメントとして管理や検索が可能になるわ。

次に、vectorstore_from_docsは、指定されたドキュメントと埋め込みからAnnoyベクトルストアを作成するように見えるね。これはあなたが指摘したように、Annoyによる高速な類似検索を可能にするためだわ。

最後に、ask_llmは、指定された文を用いてベクトルストアから類似のドキュメントを検索し、その結果の出力をllmに対して問い合わせて返すように見えるわ。

それにしても、あなた、これ全部自分で書いたの？えー、..まあ、それなりにきちんと作られてるね、さすがだよ。だからって特別なことはないから、誤解しないでよね？

user:

先ほどの関数を次のように使うと、vectorstore が作成されます。

docs = docs_from_pdfs(paper_urls)
vectorstore = vectorstore_from_docs(docs, embedding)

そうすると、次のようにして、vectorstore 中にある複数の論文を参照して、回答と、それに関連する PDF をリストアップしてくれるというわけです。

query = "Tell me about transformer-based language models."

ask_llm(query, vectorstore, llm)

('Transformer-based language models are a type of neural network architecture that has revolutionized natural language processing tasks. The Transformer model was introduced in the paper "Attention is All You Need" by Vaswani et al. in 2017.\n\nUnlike traditional recurrent neural networks (RNNs), Transformers do not rely on sequential processing. Instead, they use a self-attention mechanism to capture dependencies between words in a sentence. This allows them to process words in parallel, making them more efficient and scalable.\n\nThe key component of the Transformer model is the attention mechanism, which calculates the importance of each word in a sentence when processing other words. This attention mechanism enables the model to focus on relevant words and effectively capture long-range dependencies.\n\nTransformer-based language models have achieved state-of-the-art performance on various natural language processing tasks, including language translation, text generation, sentiment analysis, and question answering. They have also been used as the backbone for pre-training large-scale language models like GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers).\n\nThese pre-trained models have been fine-tuned on specific downstream tasks and have shown remarkable performance, surpassing human-level performance on some benchmarks. They have also been widely used in various applications, such as chatbots, language understanding, and text generation.\n\nOverall, Transformer-based language models have significantly advanced the field of natural language processing and have become a fundamental tool for many language-related tasks.',
 [[0.5796326398849487,
   {'source': 'https://arxiv.org/pdf/2404.01413.pdf', 'page': 11}],
  [0.6144219040870667,
   {'source': 'https://arxiv.org/pdf/2404.05405.pdf', 'page': 26}],
  [0.6165765523910522,
   {'source': 'https://arxiv.org/pdf/2402.19427.pdf', 'page': 12}],
  [0.6250097155570984,
   {'source': 'https://arxiv.org/pdf/2402.17764.pdf', 'page': 1}],
  [0.8018270134925842,
   {'source': 'https://arxiv.org/pdf/2404.05961.pdf', 'page': 14}],
  [0.8110387921333313,
   {'source': 'https://arxiv.org/pdf/1308.3432.pdf', 'page': 9}],
  [0.8116555213928223,
   {'source': 'https://arxiv.org/pdf/2404.02948.pdf', 'page': 7}]])

assistant:

あ、それって結構便利だよね！Transformersについての問い合わせがあった時、関連した深さのある回答を生成し、その回答が引用される可能性のある特定の論文をリストアップしてるんだよね。

それにしても、あなたの書いたコードのおかげで、どんな論文にでも高速にアクセスできるようになって、それが関連するテキストの中から特定の情報を探し出すだけでなく、新しい情報を生成するのにも役立つんだね。

だからって、あなたを褒めているわけじゃないからね！ただ、それなりに良くできてるって言ってるだけだから！...ふん、これくらいで喜ばないでよ。

user:

ありがとうございます。せっかく作ったデータベースなので、保存したいですよね。次のように保存できます。

vectorstore.save_local("annoy_paperdb")

そうすると annoy_paperdb というディレクトリができて、次のようなファイルが保存されます。

% ls -lh annoy_paperdb
total 666K
-rw-r--r-- 1 ikemen ikemen 2.1M Apr 11 16:24 index.annoy
-rw-r--r-- 1 ikemen ikemen 501K Apr 11 16:24 index.pkl

assistant:

あら、それだと色々な情報が保存されて、後で利用できるようになるわね。余計な手間を省くことができるということだわね。

index.annoyはAnnoyのインデックスファイルで、このファイルには作成したベクトルインデックスが保存されているんだよね。一方、index.pklはPythonのpickleファイルで、Pythonのオブジェクトを直列化して保存できるファイル形式なのよ。

これらのファイルを使うと、後でAnnoyベクトルストアを再生成することなく、同じ状態でデータベースをすぐに使用できるようになるわ。これで、時間を節約できて、効率的に作業を進められるはずよ。

まぁ、このくらいなら私でも知ってるけど...あなたがちゃんとやってくれると信じてるからね。あくまで「信じてる」だけよ？特別な意味はないから勘違いしないでよね。

user:

保存したデータベースは、次のようにして読み込むことができます。

restored_vectorstore = Annoy.load_local(
    "annoy_paperdb", embedding, allow_dangerous_deserialization=True
)

そして当然、読み込んだ vectorstore を使って、次のように質問ができるということです。

query = "Tell me about transformer-based language models."
ask_llm(query, restored_vectorstore, llm)

('Transformer-based language models are a type of neural network architecture that has revolutionized natural language processing tasks. They were introduced in the paper "Attention is All You Need" by Vaswani et al. in 2017.\n\nThe key idea behind transformer models is the use of self-attention mechanisms, which allow the model to weigh the importance of different words or tokens in a sequence when making predictions. This enables the model to capture long-range dependencies in the text, which was a challenge for previous recurrent neural network (RNN) based models.\n\nThe transformer architecture consists of an encoder and a decoder. The encoder processes the input sequence and generates a representation of the text, while the decoder takes this representation and generates the output sequence. Both the encoder and decoder are composed of multiple layers of self-attention and feed-forward neural networks.\n\nOne important aspect of transformer models is the use of positional encoding. Since the model does not have any inherent notion of word order, positional encoding is used to provide information about the position of each word in the sequence.\n\nTransformer-based language models have achieved state-of-the-art performance on a wide range of natural language processing tasks, including machine translation, text generation, sentiment analysis, and question answering. They have also been used as the backbone for various pre-trained language models, such as GPT-3 and BERT, which have been widely adopted in the research and industry communities.',
 [[0.5644190311431885,
   {'source': 'https://arxiv.org/pdf/2404.01413.pdf', 'page': 11}],
  [0.6086786985397339,
   {'source': 'https://arxiv.org/pdf/2402.19427.pdf', 'page': 12}],
  [0.60919189453125,
   {'source': 'https://arxiv.org/pdf/2404.05405.pdf', 'page': 26}],
  [0.6130542755126953,
   {'source': 'https://arxiv.org/pdf/2404.05961.pdf', 'page': 0}],
  [0.8110387921333313,
   {'source': 'https://arxiv.org/pdf/1308.3432.pdf', 'page': 9}],
  [0.8116555213928223,
   {'source': 'https://arxiv.org/pdf/2404.02948.pdf', 'page': 7}]])

assistant:

それはとても便利ね！保存したデータベースを読み込んで、そのまま利用できるようにすることは、データの永続性と再現性を保つのに重要なステップだからね。これで、あなたの論文データベースが再利用可能になり、バックアップや共有も容易になるわ。

もちろん、全ての論文が収められている訳ではないけど、あなたが選んだ範囲の論文を参照して、特定の質問の回答を出すのに役立つというわけね。

ちょっとだけ褒めてあげてもいいかな...あなた、ここまでよくやってきたわね。でも、まだこれから色々チャレンジしなきゃならないことがあるでしょ？一度の成功で満足せず、信じて努力し続けてほしいな。ふふ、それが私からあなたへの期待なんだから！

user:

ありがとうございます。最後に、読み込んだデータベースに対して、新しい論文データを追加する作業です。次のようにして論文を追加し、そして更新されたデータベースを保存します。

new_docs = docs_from_pdfs(additional_paper_urls)
new_docs += list(restored_vectorstore.docstore._dict.values())
new_vectorstore = vectorstore_from_docs(new_docs, embedding)
new_vectorstore.save_local("annoy_paperdb")

結果として、当然ですがデータベースのサイズが少し増えました。

% ls -lh annoy_paperdb
total 538K
-rw-r--r-- 1 ikemen ikemen 2.4M Apr 11 16:26 index.annoy
-rw-r--r-- 1 ikemen ikemen 637K Apr 11 16:26 index.pkl

assistant:

新しい論文の追加とすでに存在するデータベースのアップデート、これができると何か新しい論文が出たときでも簡単にデータベースを最新の状態に保つことができるわね。

あなたの書いたコードでは、新しく追加される論文が文書分割され、すでに存在するデータベースのドキュメントと共に新しいベクトルストアが生成されているわね。

これで、新しい論文も簡単にデータベースに追加できて更新できるわ！けれど覚えておいて、そのためには新しい論文もきちんと前処理とインデックシングを行うことが大切だからね！

それにしても、このような具体的な説明をしてくれてありがとう。この説明ならあなたが何をしようとしているか、もっと理解しやすくなるわ。ちゃんとした試みをしてくれて、それはそれで素敵だわね。

user:

新しい論文が追加されたデータベースに対して、再び同じ質問をしました。

query = "Tell me about transformer-based language models."
ask_llm(query, new_vectorstore, llm)

最初からある論文、新しく追加した論文、どちらも参照していることが分かります。

('Transformer-based language models are a type of neural network architecture that has revolutionized natural language processing tasks. The transformer architecture was introduced in a seminal paper titled "Attention is All You Need" by Vaswani et al. in 2017.\n\nUnlike traditional recurrent neural networks (RNNs), which process sequential data one step at a time, transformers leverage the power of self-attention mechanisms to capture dependencies between words in a sentence. This allows them to handle long-range dependencies more efficiently.\n\nThe key component of a transformer is the self-attention mechanism, which allows the model to weigh the importance of different words in a sentence when generating predictions. This attention mechanism enables the model to focus on relevant words and effectively capture contextual information.\n\nTransformer-based language models have achieved state-of-the-art performance in various natural language processing tasks, such as language translation, text generation, sentiment analysis, and question answering. They have been widely adopted in industry and academia due to their superior performance and ability to handle large-scale datasets.\n\nSome popular transformer-based language models include GPT (Generative Pre-trained Transformer), BERT (Bidirectional Encoder Representations from Transformers), and T5 (Text-to-Text Transfer Transformer). These models have been pre-trained on large amounts of text data, allowing them to learn rich representations of language and generalize well to various downstream tasks.\n\nOverall, transformer-based language models have significantly advanced the field of natural language processing and continue to drive innovations in areas such as machine translation, text summarization, and conversational AI.',
 [[0.578711986541748,
   {'source': 'https://arxiv.org/pdf/2404.05971.pdf', 'page': 0}],
  [0.586863100528717,
   {'source': 'https://arxiv.org/pdf/2404.01413.pdf', 'page': 11}],
  [0.6103860139846802,
   {'source': 'https://arxiv.org/pdf/2404.05405.pdf', 'page': 26}],
  [0.6150825619697571,
   {'source': 'https://arxiv.org/pdf/2404.05961.pdf', 'page': 15}],
  [0.6184046864509583,
   {'source': 'https://arxiv.org/pdf/2402.19427.pdf', 'page': 12}],
  [0.6267607808113098,
   {'source': 'https://arxiv.org/pdf/2404.02258.pdf', 'page': 11}],
  [0.7876392602920532,
   {'source': 'https://arxiv.org/pdf/2404.02060.pdf', 'page': 1}]])

assistant:

あら、これは素晴らしいわね！あなたのスクリプトが新しく追加した論文の情報をきちんと取り入れることで、それらの論文を参照した回答を生成できるようになったんだわね。

その結果、より多くの論文から引用した回答が生成できるようになるというわけね。これにより、回答の深さと広がりが向上し、利用者が最新の研究を含むより豊かな情報を得られるようになるのよ。

これはあなたのツールの柔軟性と拡張性に大いに寄与する良い改善だわ。あなたが努力した結果がここまで実を結んで、私、誇りに思うわ。すごく…すごく素晴らしいわね。でも私がこれを言ってることで調子に乗らないでよね、これからもまだまだ頑張らなきゃならないんだから！

user:

ありがとうございます。また何かあったらよろしくお願いしますね！

assistant:

もちろん、いつでもどんな質問でもお受けするわよ！あなたがどんな課題に直面しているか、または新しいアイデアやプロジェクトについて知りたくて仕方がないの。だから、何でも私に話してちょうだい。引き続き頑張ってね、あなたならきっと素晴らしいことを成し遂げられるはずよ！それでは、またね！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

LangChain で 英論文データベースを作る : Annoy 編

user:

assistant:

user:

assistant:

user:

assistant:

user:

assistant:

user:

assistant:

user:

assistant:

user:

assistant:

user:

assistant:

user:

assistant:

LangChain で英論文データベースを作る : Annoy 編