More than 1 year has passed since last update.

LangChainのTaggingでキーワード抽出

Posted at 2024-06-09

はじめに

LangChainのtaggingを使ってキーワード抽出をしてみました。
環境は以下のとおりです。

Windows11
プロセッサ 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz 2.42 GHz
実装 RAM 8.00 GB (7.71 GB 使用可能)
Intel(R) Iris(R) Xe Graphics

ollama

今回は、ollama をインストールして、LLMは phi3-mini を使用しました。
olamma のインストールは簡単ですが、
Windows版 Ollama と Ollama-ui を使ってPhi3-mini を試してみた
などを参照してください。
ollama は今のところ、外部ディスクにあるLLMを参照できないので、llama.cpp の方が、最初のインストールは面倒ですが、かえって楽かもしれません。

Tagging

Tagging（タグ付け）とは、ドキュメントにラベルを付けることを意味します。
コードは、Taggingのスニペットをほぼまるのまま使っています。

コード

コードは以下のとおりです。
Classificationクラスの、description がLLMに対するプロンプトになって、キーワードを抽出してくれます。
プロンプトによって、抽出されるキーワードが変わるので、このコードでは試しに、keyword1～4を抽出させています。

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field, root_validator
from langchain_experimental.llms.ollama_functions import OllamaFunctions
from typing import List, Optional, Dict, Any

tagging_prompt = ChatPromptTemplate.from_template(
    """
Extract the desired information from the following passage.

Only extract the properties mentioned in the 'Classification' function.

Passage:
{input}
"""
)

class Classification(BaseModel):
    Keyword: List[str] = Field(
         description ="three(3) Japanese words (other than a pronoun) used to identify any of a class of people, places, or things (common noun), or Noun forms of verbs and so on , or adjective, or to name a particular one of these (proper noun) from {input}"
    )
    Keyword2: List[str] = Field(
         description ="Up to 3 Japanese keywords that you consider important"
    )
    Keyword3: List[str] = Field(
         ...,
         alias="日本語",　　# なぜかこれがないと日本語のdescriptionを認識しない。バッドノウハウかも。
         description = r"あなたが重要と思うデータベースを検索するためのキーワードを３つ以内でlist形式で提示してください"
    )
    Keyword4: List[str] = Field(
         ...,
         description ="Please provide up to 3 the hypernym and hyponym words for the keywords that you consider important"
    )
    adjective: List[str] = Field(
        ...,
        description ="Up to 3 words naming an attribute of a noun, such as sweet, red, or technical."
    )
    person: List[str] = Field(None,description="Pople or Person name e.g. 吉田茂")
    sentiment: Optional[str] = Field(description="The sentiment of the text")
    aggressiveness: Optional[int] = Field(
        description="How aggressive the text is on a scale from 1 to 10"
    )
    language: str = Field(
        "unknown", 
        description="The language the text is written in",
        enum=["japanese", "spanish", "english", "french", "german", "italian"]
    )

　　# これはpersonがエラーを吐くので入れましたが、バッドノウハウかもしれません。
    @root_validator(skip_on_failure=True)
    def _remove_person_if_none(cls, values: Dict[str, Any]) -> Dict[str, Any]:
        if "person" in values and values["person"] is None:
            del values["person"]
        return values

llm = OllamaFunctions(model="phi3", temperature=0).with_structured_output(
    Classification
)

tagging_chain = tagging_prompt | llm
inp = r"""小野亘. “OPAC等レガシーな検索システムに対する大規模言語モデル技術の適用可能性について.” Jst.go.jp, 2020, jxiv.jst.go.jp/index.php/jxiv/preprint/view/679, https://doi.org/10.51094/jxiv.679. Accessed 9 June 2024.
‌本稿は、GPTのような大規模言語モデル（LLM）の技術の進展に伴い、
図書館の蔵書検索（OPAC）のようなレガシーな検索システムに対して、GPTのような大規模言語モデルの技術が、
検索質問の生成、検索式への変換、意味を考慮した検索、結果の表示と適合性の評価という検索課程のそれぞれに対して、
適用できることを示した。また、OPAC自体がLLMに対しての情報基盤となり得ることを検討した。elvin schrödinger
"""
print(res.dict())

結果

{'Keyword': ['小野亘', 'OPAC', '検索システム', '大規模言語モデル技術', 'GPT', 'LLM', '検索質問の生成', '検索式への変換', '意味を考慮した検索', '結果の表示', '適用できること', 'OPAC自体が情報基盤'], 'Keyword2': ['大規模言語モデル技術', 'GPT', 'LLM'], 'Keyword3': ['日本'], 'Keyword4': ['検索システム', 'OPAC', '大規模言語モデル技術'], 'adjective': [], 'sentiment': None, 'aggressiveness': None, 'language': 'unknown'}

ちなみに、鴎外の舞姫のあらすじの一部を入れてみると、次の結果になりました。

{'Keyword': ['時', '19世紀末', '太田豊太郎', 'ドイツ', '留学', '日本', '船', '帰国', 'サイゴン', '無駄', '苦悩', '記す'], 'Keyword2': ['19世紀末', '太田豊太郎', 'サイゴン'], 'Keyword3': ['日本'], 'Keyword4': ['時間', '留学', '苦悩'], 'adjective': [], 'sentiment': None, 'aggressiveness': None, 'language': 'unknown'}

難点

CPUのみでノートPCでいわゆる軽量LLMのPhi3-miniを使っているせいだと思いますが、キーワードを抽出するのに、1分30秒から2-3分かかり、結構な割合でたまに失敗（pydanticがvalidationエラーを吐く）します。
単にキーワード抽出をするのであれば、TF-IDFとかBM25とか、pke(python keyphrase extraction）、spaCyなどを使う方が早いし楽かもしれません。
ただ、推論さえ早ければ、プロンプト次第で、目的にあった形でキーワードを抽出してくれそうなので、そこにメリットはありそうです。
ちなみに、Phi3-mini に変えて、gemma:2b でやってみましたが、今回の例ではつかいものになりませんでした。

応用

本来の目的のタグ付けのほか、図書館の検索システム（OPAC）のようなものに、キーワードを入力したい場合に使えるのではないか、と思っています。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up