More than 1 year has passed since last update.

LangChainにオープンな言語モデルを組み込んでアレコレしてみる (2) ~LangChain×オープンな言語モデル×GPU~

Last updated at 2023-05-09Posted at 2023-05-08

背景

以前、LangChainにオープンな言語モデルであるGPT4Allを組み込んで動かしてみました。
※ 今回使用する言語モデルはGPT4Allではないです。

推論が遅すぎてローカルのGPUを使いたいなと思ったので、その方法を調査してまとめます。
GPUは使用可能な状態の前提です。
Windows環境の場合、以下の記事を参考にしてみてください。

完成品

最終的にできあがったソースコードを置いておきます。

from langchain import HuggingFacePipeline, PromptTemplate, LLMChain
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

# GPUの確認
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"\n!!! current device is {device} !!!\n")

# モデルのダウンロード
model_id = "inu-ai/dolly-japanese-gpt-1b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)

# LLMs: langchainで上記モデルを利用する
task = "text-generation"
pipe = pipeline(
    task, 
    model=model,
    tokenizer=tokenizer,
    device=0,            # GPUを使うことを指定 (cuda:0と同義)
    framework='pt',      # モデルをPyTorchで読み込むことを指定
    max_new_tokens=32,
    temperature=0.1,
)

llm = HuggingFacePipeline(pipeline=pipe)

# Prompts: プロンプトを作成
template = r"<s>\n以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。\n[SEP]\n指示:\n{instruction}\n[SEP]\n入力:\n{question}\n[SEP]\n応答:\n"

# Chains: llmを利用可能な状態にする
llm_chain = LLMChain(prompt=prompt, llm=llm, verbose=True)

# 指示を与え、質問を投げる
instruction = "あなたはユーザの入力に応答するアシスタントです。文章の繰り返しを避け、簡潔に要約して回答してください。"
question = "東京ってなに？"

generated_text = llm_chain.run({"instruction":instruction, "question":question})
print(generated_text)

### Output ###
東京は、日本の首都であり、日本の政治、経済、文化の中心地です。\n

GPU利用方法調査の流れ

色々調査していると、こんなPullリクエストが見つかりました。

「HuggingfacePipelineを使って言語モデルをロードするとき、ローカルのGPUを使用可能にしたよ」みたいな感じでした。
これを見てローカルGPU使えそうだなと思い、LangChainのドキュメントとソースコードを眺めていました。

すると、ソースコード内に以下のような記載がありました。

langchain/llms/huggingface_pipeline.py:36~

from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
model_id = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
pipe = pipeline(
    "text-generation", model=model, tokenizer=tokenizer, max_new_tokens=10
)
hf = HuggingFacePipeline(pipeline=pipe)

このソースコードでは、トークナイザーとモデルをHuggingFace Hubからダウンロードし、
それらを元にパイプラインを構成するといったことをしています。

なので、

Huggingface Hubに扱いたい言語モデルをアップロード
- (今回はオープンなモデルをダウンロードしたため、アップロードはしていません)
TransformersのAutoModelForCausalLMを使って、アップロードされたモデルを読み込み
GPUに転送
LangChainのhuggingface pilelineに組み込む

とすればいけそうな雰囲気です。

手順

1.HuggingFaceのアカウント作成 (任意)

色々な人がデータセット・モデルなどをアップロードしているコミュニティ？です
- Transformerの派生である、GPT・LLamaなどのPre-Trainedモデルがたくさん公開されています
- 眺めてみるだけでも面白かったです

上記サイトへアクセスし、SignUpをクリックしてユーザ登録します。
メール認証したら作成完了です。

2. ダウンロードするモデルを探す

使用するモデル

HuggingFace Hubで色々漁った結果、今回はひとまず以下のモデルを使います。

dolly-japanese-gpt-1bについて

1.3Bパラメータの日本語GPT-2モデルを使用した対話型のAIです。VRAM 7GB または RAM 7GB が必要で、問題なく動作すると思われます。

主要パラメータ

モデルタイプ: GPT2 (GPT2の派生)
パラメータ数: 13億 (GPT2より若干少なめ)
head数: 16
レイヤー数: 24
単語数: 45000

3. ダウンロードしてGPU環境でモデル読み込み

GPU環境の制約？ (Pytorchしか対応していない？)

transformersというPythonパッケージによる制約っぽいです
PyTorch・TensorFlowの両方に対応していますが、PyTorch側しかローカルGPU対応していなさげ？
（↑ちゃんとソースコード読んでいないのでご存知の方いれば教えてください。）

GPUで読み込む際の流れ

モデルをHuggingFace Hubからダウンロード
tokennizerを作成
modelを作成
model.to('cuda')でGPUに転送

ソースコードの実行に必要なパッケージ群をインストール

pip install transformers==4.28.1
pip install tokenizers==0.13.3
pip install torch

テストプログラムを実行

tests/huggingface_test.py

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# GPUの確認
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"\n!!! current device is {device} !!!\n")

# モデルのダウンロード & トークナイザーとモデルの作成
model_id = "inu-ai/dolly-japanese-gpt-1b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)

実行確認

エラーが表示されなければOKです

python tests/huggingface_test.py 

!!! current device is cuda !!!

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

4. LangChainでGPU動かす

推論時にエラーが発生

原因は、「モデルはGPUにロードされているけど、インプットのデータがCPU上に展開されているため」でした。

エラーが発生したコード

from langchain import HuggingFacePipeline, PromptTemplate, LLMChain
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

# GPUの確認
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"\n!!! current device is {device} !!!\n")

# モデルのダウンロード
model_id = "inu-ai/dolly-japanese-gpt-1b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)

# LLMs: langchainで上記モデルを利用する
task = "text-generation"
pipe = pipeline(
    task, 
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=32,
    temperature=0.1,
)

llm = HuggingFacePipeline(pipeline=pipe)

# Prompts: プロンプトを作成
template = r"<s>\n以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。\n[SEP]\n指示:\n{instruction}\n[SEP]\n入力:\n{question}\n[SEP]\n応答:\n"

# Chains: llmを利用可能な状態にする
llm_chain = LLMChain(prompt=prompt, llm=llm, verbose=True)

# 指示を与え、質問を投げる
instruction = "あなたはユーザの入力に応答するアシスタントです。文章の繰り返しを避け、簡潔に要約して回答してください。"
question = "東京ってなに？"

generated_text = llm_chain.run({"instruction":instruction, "question":question})
print(generated_text)

解決策: pipeline()の引数でGPUを使うことを明示する

Transformersのソースコードを読んでいると以下のような記載がありました。

LangChaingのモデル読み込み時に使う、pipeline()に関するソースコードです。

pipeline初期化の引数
- framework:
  - PyTorchかTensorFlowか指定できる
- device:
  - -1だとCPUを利用、0以上だとGPUを利用するっぽいです
  - "cuda:0"のようにデバイス名を直接指定することも可能

transformers/pipelines/base.py:690~

PIPELINE_INIT_ARGS = r"""
    Arguments:
        model ([`PreTrainedModel`] or [`TFPreTrainedModel`]):
            The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from
            [`PreTrainedModel`] for PyTorch and [`TFPreTrainedModel`] for TensorFlow.
        tokenizer ([`PreTrainedTokenizer`]):
            The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from
            [`PreTrainedTokenizer`].
        modelcard (`str` or [`ModelCard`], *optional*):
            Model card attributed to the model for this pipeline.
        ##### ここ #####
        framework (`str`, *optional*):
            The framework to use, either `"pt"` for PyTorch or `"tf"` for TensorFlow. The specified framework must be
            installed.
            If no framework is specified, will default to the one currently installed. If no framework is specified and
            both frameworks are installed, will default to the framework of the `model`, or to PyTorch if no model is
            provided.
        task (`str`, defaults to `""`):
            A task-identifier for the pipeline.
        num_workers (`int`, *optional*, defaults to 8):
            When the pipeline will use *DataLoader* (when passing a dataset, on GPU for a Pytorch model), the number of
            workers to be used.
        batch_size (`int`, *optional*, defaults to 1):
            When the pipeline will use *DataLoader* (when passing a dataset, on GPU for a Pytorch model), the size of
            the batch to use, for inference this is not always beneficial, please read [Batching with
            pipelines](https://huggingface.co/transformers/main_classes/pipelines.html#pipeline-batching) .
        args_parser ([`~pipelines.ArgumentHandler`], *optional*):
            Reference to the object in charge of parsing supplied pipeline parameters.
        ##### ここ #####
        device (`int`, *optional*, defaults to -1):
            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, a positive will run the model on
            the associated CUDA device id. You can pass native `torch.device` or a `str` too.
        binary_output (`bool`, *optional*, defaults to `False`):
            Flag indicating if the output the pipeline should happen in a binary format (i.e., pickle) or as raw text.
"""

このファイルを読んでいくと、PyTorch＆device=cuda:xの時に、GPUを利用してくれるみたいでした。

if self.framework == "pt" and device is not None and not (isinstance(device, int) and device < 0):
        self.model.to(device)

以上を踏まえて先ほどのソースコードを修正します。

修正後のソースコード

from langchain import HuggingFacePipeline, PromptTemplate, LLMChain
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

# GPUの確認
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"\n!!! current device is {device} !!!\n")

# モデルのダウンロード
model_id = "inu-ai/dolly-japanese-gpt-1b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)

# LLMs: langchainで上記モデルを利用する
task = "text-generation"
pipe = pipeline(
    task, 
    model=model,
    tokenizer=tokenizer,
    device=0,            # GPUを使うことを指定 (cuda:0と同義)
    framework='pt',      # モデルをPyTorchで読み込むことを指定
    max_new_tokens=32,
    temperature=0.1,
)

llm = HuggingFacePipeline(pipeline=pipe)

# Prompts: プロンプトを作成
template = r"<s>\n以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。\n[SEP]\n指示:\n{instruction}\n[SEP]\n入力:\n{question}\n[SEP]\n応答:\n"

# Chains: llmを利用可能な状態にする
llm_chain = LLMChain(prompt=prompt, llm=llm, verbose=True)

# 指示を与え、質問を投げる
instruction = "あなたはユーザの入力に応答するアシスタントです。文章の繰り返しを避け、簡潔に要約して回答してください。"
question = "東京ってなに？"

generated_text = llm_chain.run({"instruction":instruction, "question":question})
print(generated_text)

### Output ###
東京は、日本の首都であり、日本の政治、経済、文化の中心地です。\n

無事出力されました。
推論時間は、以下の通りでした。モデルの複雑さ・入力・出力の長さによって更に差が開くと思います。

CPU利用時: 7.2s
GPU利用時: 0.7s

まとめ

LangChain×オープンな言語モデル×ローカルGPUの環境を作成することができました。
jupyter notebookで色々していましたが、推論部分の遅さに悩んでいました、、
これで推論部分が早くなったので開発が捗りそうです。

次回以降は以下を取り扱います！

Chat機能
- 会話履歴保持しながら、ユーザと会話させたいです
Embedding機能
- ローカルファイルの情報に基づいたモデルの回答を実現したいです。
- 社内ドキュメントとかを読み込んで、社内専用チャットボットみたいなイメージです
- 以前、OpenAI APIを使って実装してみましたが、割と複雑になってしまいました。
- LangChainの力で簡素化できたらいいなと思っています。
Agent機能
- 当初の目的である、ReActを実装したいです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up