More than 1 year has passed since last update.

N番煎じで東京大学松尾研究室のweblab-10b-instruction-sftをDatabricksで動かす

Last updated at 2023-08-21Posted at 2023-08-18

仕事が休みの間に書いてみる。

東京大学松尾研究室から大規模言語モデルweblab-10bが公開されました。

以下の御大の方々がすぐに試されており（いつもお世話になっております）、
何番煎じかわかりませんが私もDatabricks上で試してみました。
(dbutilsを使う処理以外はDatabricks以外でも動作すると思います）
ただ試すだけだと面白味がないので、ちょっとした工夫(高効率化)もやってみます。

実施環境

全てDatabricks(on AWS)のNotebook上で実行。DBRは13.2ML。
ノードタイプはg5.16xlargeです。

準備

Huggingfaceからモデルをダウンロード。
今回は再利用を前提として、クラスタのローカルストレージに保管した後に、Unity Catalogのボリュームに保管します。

念のため、transformersを最新化して、

%pip install -U -qq transformers accelerate

dbutils.library.restartPython()

スナップショットをローカルストレージにダウンロード。

import os
from huggingface_hub import snapshot_download

model = "matsuo-lab/weblab-10b-instruction-sft"
local_dir = "/tmp/matsuo-lab/weblab-10b-instruction-sft"

snapshot_location = snapshot_download(
    repo_id=model,
    local_dir=local_dir,
    local_dir_use_symlinks=False,
)

その後、Unity Catalogのボリュームに保管。

UC_VOLUME = "/Volumes/Unity Catalogのボリュームパス"

dbutils.fs.cp(
    f"file:{local_dir}", 
    f"{UC_VOLUME}/matsuo-lab/weblab-10b-instruction-sft", 
    recurse=True,
)

保管されました。

実行

こちらの記事とほぼ同じコードで実行します。

import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

UC_VOLUME = "/Volumes/Unity Catalogのボリュームパス"
tokenizer_path = f"{UC_VOLUME}/models--matsuo-lab--weblab-10b-instruction-sft"
model_path = f"{UC_VOLUME}/models--matsuo-lab--weblab-10b-instruction-sft"

model = AutoModelForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.float16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)


text = "大規模言語モデルについて説明してください。"
text = f'以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。\n\n### 指示:\n{text}\n\n### 応答:'
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        token_ids.to(model.device),
        max_new_tokens=100,
        do_sample=True,
        temperature=0.5,
        top_p=0.9
    )

output = tokenizer.decode(output_ids.tolist()[0])
print(output)

結果

以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。

### 指示:
大規模言語モデルについて説明してください。

### 応答:大規模言語モデルは、大規模なデータセットを使用して、言語の構造とパターンをよりよく理解することを目的とした、データ駆動型の言語モデルです。大規模言語モデルでは、大量のデータを処理して、言語のパター

関数にラップして、他の指示を実行させます。

# 関数ラップ
def generate_batch(instruction: str, max_tokens: int = 100):

    prompt = f"以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。\n\n### 指示:\n{instruction}\n\n### 応答:"

    token_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")

    with torch.no_grad():
        output_ids = model.generate(
            token_ids.to(model.device),
            max_new_tokens=max_tokens,
            do_sample=True,
            temperature=0.5,
            top_p=0.9,
        )

    output = tokenizer.decode(output_ids.tolist()[0])
    return output

generate_batch("夏の暑さをしのぐ方法を教えてください。")

結果

以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。

### 指示:
夏の暑さをしのぐ方法を教えてください。

### 応答:夏の暑さをしのぐには、水泳、サウナ、日光浴、マスクをしたり、運動をしたり、冷たい飲み物を飲んだり、冷たい食べ物を食べたり、熱い食べ物を食べたり、アイスクリームを食べたり

この後、要約タスクなども実行させようとしたのですが、CUDAのOut of Memoryが発生してダメでした。A10(VRAM:20GB)だとVRAMがカツカツで、かなり厳しい感じ。

また、上の処理(max_new_tokens=100)だと、1生成につき10数秒程度の処理時間かなと思います。

推論の省メモリ・高速化

VRAMがかなり厳しいので、少し馬鹿になるのと引き換えに量子化します。また、推論も高速化します。

huggingface transformers上で量子化してもいいのですが、
Databricks上での使い勝手が良くて個人的に気に入っているCTranslate2を使います。
CTranslate2については、御大の記事がわかりやすいです。

モデルの変換

必要なモジュールをインストールします。

%pip install -U -qq ctranslate2 transformers accelerate

dbutils.library.restartPython()

パスを設定した上で、ct2-transformers-converterを実行。
int8_bfloat16を指定することで8bitで量子化します。量子化の指定は環境に応じて見直してください。
※　今更ですがint8_float16でよかったかも。。。

UC_VOLUME = "/Volumes/Unity Catalogのボリュームパス"

src_path = f"{UC_VOLUME}/models--matsuo-lab--weblab-10b-instruction-sft"
output_dir = "/tmp/llm/weblab-10b" # 一時的な保管場所

!ct2-transformers-converter --model {src_path} --copy_files tokenizer.json tokenizer_config.json special_tokens_map.json --output_dir {output_dir} --quantization int8_bfloat16 --low_cpu_mem_usage

--copy_filesオプションでtokenizer.json、tokenizer_config.json、special_tokens_map.jsonの3ファイルをコピーしているのは、ローカル上のファイルを使ってtokenizerをロードできるようにするためです。

問題なければoutput_dirで指定した場所に6個のファイルができているはずです。
永続化のために、Unity Catalogボリュームに保管します。

UC_CT2_VOLUME = "/Volumes/CT2で変換したモデルを保管するためのUnity Catalogのボリュームパス"
dbutils.fs.cp("file:"+output_dir, f"{UC_CT2_VOLUME}/matsuo-lab/weblab-10b-instruction-sft", recurse=True)

通常だとこれで変換タスクは完了なのですが、私の環境だと(~~transformersやCTranslat2のバージョンの問題なのか~~ special_tokens_map.jsonが空のためのようですね）、生成されるconfig.jsonの各パラメータが全てNullになってしまいました。
このままだとモデルのロード時にエラーが出るため、手動でconfig.jsonを作り直します。
変換前にspecial_tokens_map.jsonを修正しておくと、たぶんこの手間を減らせます(未検証)

# 念のためバックアップ
dbutils.fs.cp(
    f"{UC_CT2_VOLUME}/matsuo-lab/weblab-10b-instruction-sft/config.json",
    f"{UC_CT2_VOLUME}/matsuo-lab/weblab-10b-instruction-sft/config_bak.json",
)

# bos_tokenやeos_tokenを手動で設定
conf = '{\n  "bos_token": "<|endoftext|>",\n  "eos_token": "<|endoftext|>",\n  "layer_norm_epsilon": 1e-05,\n  "unk_token": "<|endoftext|>"\n}\n'
dbutils.fs.put(f"{UC_CT2_VOLUME}/matsuo-lab/weblab-10b-instruction-sft/config.json", contents=conf, overwrite=True)

special_tokens_map.jsonも更新します。


# 念のためバックアップ
dbutils.fs.cp(
    f"{UC_CT2_VOLUME}/matsuo-lab/weblab-10b-instruction-sft/special_tokens_map.json",
    f"{UC_CT2_VOLUME}/matsuo-lab/weblab-10b-instruction-sft/special_tokens_map_bak.json",
)

# 手動で設定
conf = """
{
  "bos_token": "<|endoftext|>",
  "eos_token": "<|endoftext|>",
  "pad_token": "<|padding|>",
  "unk_token": "<|endoftext|>"
}
"""
dbutils.fs.put(
    f"{UC_CT2_VOLUME}/weblab-10b-instruction-sft/special_tokens_map.json",
    contents=conf,
    overwrite=True,
)

推論Again

CT2を使って推論。

import ctranslate2
import transformers
import torch

model_path = f"{UC_CT2_VOLUME}/matsuo-lab/weblab-10b-instruction-sft"

# ジェネレーターとトークナイザーの準備
device = "cuda" if torch.cuda.is_available() else "cpu"
generator = ctranslate2.Generator(model_path, device=device)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)

# プロンプトの準備
text = "大規模言語モデルについて説明してください。"
text = f"以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。\n\n### 指示:\n{text}\n\n### 応答:"

# 推論の実行
tokens = tokenizer.convert_ids_to_tokens(
    tokenizer.encode(text, add_special_tokens=False)
)
results = generator.generate_batch(
    [tokens],
    max_length=80,
    sampling_topk=10,
    sampling_temperature=0.1,
    repetition_penalty=1.1,
    include_prompt_in_result=False,
    cache_static_prompt=True,
    return_scores=True,
)
text = tokenizer.decode(results[0].sequences_ids[0])
print(text)

include_prompt_in_result=Falseを指定しているので、応答結果のみ表示されます。

結果

大規模言語モデルは、膨大な数の単語を含むテキスト・ベースのコーパスから学習するように設計された機械学習アプリケーションである。このタイプのモデ

この結果だけ見れば、そんなに精度は落ちてないかも？

他にも関数でラップした上でやってみます。(max_lengthも増やします)

def generate_batch(instruction:str) -> str:

    prompt = f"以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。\n\n### 指示:\n{instruction}\n\n### 応答:"

    # 推論の実行
    tokens = tokenizer.convert_ids_to_tokens(
        tokenizer.encode(prompt, add_special_tokens=False)
    )
    results = generator.generate_batch(
        [tokens],
        max_length=256,
        sampling_topk=10,
        sampling_temperature=0.5,
        repetition_penalty=1.1,
        include_prompt_in_result=False,
        cache_static_prompt=True,
        return_scores=True,
    )

    text = tokenizer.decode(results[0].sequences_ids[0])
    return text

日本語指示

print(generate_batch("夏の暑さをしのぐ方法を教えてください。"))

結果

暑さをしのぐには、水分と冷たい飲み物やソーダを取ることが重要です。まずはアイスティーかホットレモンでクールダウンしてから、次に氷入りの冷たいドリンクが効果的な方法です。

英語指示

# 御大の事例拝借
print(generate_batch("Tell me 5 key points to success the business"))

結果

1. Know your target market: You must have a clear understanding of who your customers are and what they want to buy from you.
2. Understand your product or service’s value proposition: This is the most important part of business planning, as it helps in determining how well your product or service meets the needs and wants of the customer.
3. Plan for peak demand: In order to ensure that your business can handle peak sales periods, you need to know when and where people will be shopping.
4. Keep track of competitors' strategies: Make sure that you know all the details about your competition so you can anticipate their moves.
5. Be consistent with your marketing efforts: If you want to keep customers coming back to your store, make sure you consistently provide them with the information they need.

日本語要約

# 織田信長のWikipediaの一部を要約させる
# https://ja.wikipedia.org/wiki/%E7%B9%94%E7%94%B0%E4%BF%A1%E9%95%B7

print(generate_batch("""
以下の文章から、織田信長がどういった人物だったかを40文字程度で要約しなさい。

織田 信長（おだ のぶなが）は、日本の戦国時代から安土桃山時代にかけての武将・大名。戦国の三英傑の一人。

尾張国（現在の愛知県）出身。織田信秀の嫡男。家督争いの混乱を収めた後に、桶狭間の戦いで今川義元を討ち取り、勢力を拡大した。足利義昭を奉じて上洛し、後には義昭を追放することで、畿内を中心に独自の中央政権（「織田政権」[注釈 4]）を確立して天下人となった戦国時代を代表する英雄である[2]。しかし、天正10年6月2日（1582年6月21日）、家臣・明智光秀に謀反を起こされ、本能寺で自害した。

これまで信長の政権は、豊臣秀吉による豊臣政権、徳川家康が開いた江戸幕府への流れをつくった画期的なもので、その政治手法も革新的なものであるとみなされてきた[3]。しかし、近年の歴史学界ではその政策の前時代性が指摘されるようになり、しばしば「中世社会の最終段階」とも評され[3]、その革新性を否定する研究が主流となっている[4][5]。      
"""))

結果

織田信長は、尾張国の出身で、家督争いに勝利した後、天下統一を目指して足利義昭と共同体を設立し、自らが天下人になることを決意して、本能寺で自害した戦国時代を代表する英雄。

メモリ量・推論時間

VRAMは10GBちょっとでした。おおよそ半分程度のメモリ使用量ですね。
推論時間も、トークンの量によりますが5秒前後まで短縮しました。

今回のモデル変換によって、推論をg4dn.xlargeで実行できます。コスパ is 正義。

まとめ

LINE社やStability.ai社のLLMなど、ここ1，2週間で日本語に強いオープンソースのLLMがどんどん出てきてますね。
今回のweblab-10bもかなり性能の良いモデルのように思います。
現状、商用利用不可なのが正直残念ですが、今後の動向が楽しみです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up