More than 1 year has passed since last update.

gradio ChatIterfaceとllama-cpp-pythonで、Llama2との対話型チャットボットを作成する

Last updated at 2023-08-02Posted at 2023-08-01

はじめに

gradioのChatIterfaceとllama-cpp-pythonを使って、Llama2と対話できるチャットボットを作成します。

環境

Thinkpad X1 carbon(CPU:Core i7 8th Gen, Mem:16GB)
Windows 10 home
Visula Studio 2017
cmake 3.27
Python 3.8.10
llama-cpp-python 0.1.77
gradio 3.39.0

環境構築

gradioインストール

今回は ChatInterfaceを利用します。changelogを見ると、ChatInterface利用可能はバージョンは、3.37以上のようなので、gradioを最新にしておきます。結果として3.39が入っています。

pip install -U gradio

llama2の導入

WindowsのCPUで、Pythonを使って動かすので、llama-cpp-pythonを入れます。githubのWindows remarksに従って、環境変数を設定し、pipで入れます。もともとcmakeやコンパイラが入っていたので、すんなりインストールできましたが、ない場合は、cmakeやコンパイラをインストールする必要があると思います。

$env:CMAKE_ARGS = "-DLLAMA_OPENBLAS=on"
$env:FORCE_CMAKE = 1
pip install llama-cpp-python

llama2パラメータダウンロード

huggingfaceでggml版をダウンロードします。数年前に購入したノートPCで動かすため、Llama2で最も小さいLlama-2-7Bを利用します。

`ChatInterface`の基本的な構成

ChatInterceは、チャットとその履歴を引数にした関数で実行する形式となっています。以下に、チャットを入力すると、yes、noをランダムに返すだけのシンプルな例を表示しています。ChatInterfaceを利用することで、数行でチャットインターフェースが作れます。
predict関数にLLMで回答するようにすれば、チャットボットが完成します。

import random
import gradio as gr
def predict(message, history):
    return random.choice(["Yes", "No"])
gr.ChatInterface(random_response).launch()

predict関数には、デフォルトでは、messageとhistoryの2つの引数をとります。第１引数のmessageには、ユーザーが入力したチャットの文字列が入ります。hisotryには、過去のユーザーと、システムのチャットの履歴が、リスト型で渡され、[['{uesr_message}', '{system_message}'], ...]となっています。messageとhisotryを使って、履歴を所定のテンプレートに従って、プロンプトを作成し、Llama2に与えると、チャット履歴を考慮した回答を返してくれます。今回は、チャットボットを作成する方法として、2種類試します。create_chat_completionメソッドを利用する方法と、huggingfaceのデモで利用されているテンプレート (便宜上、デモテンプレートと呼ぶことにします）を利用する方法です。それぞれのテンプレートによる回答の違いも確認してみます。

create_chat_completionメソッドを利用する方法

llama-cpp-pythonには、openai同様に、create_chat_completionという、チャットに対応したメソッドが用意されています。roleとcontentのキーを持つ辞書型データをリストにしたチャット履歴データをcreate_chat_completionメソッドに与えると、

### Human:{user_msg_1}
### Assistant:{system_answser}
### Human:{user_msg_2}
### Assistant:

と変換され、そのプロンプトをcreate_completionメソッドに入力し、回答が出力されます。create_completionメソッドが、Llama2による回答結果を返すメソッドになります。
create_chat_completionメソッドを使って、predict関数を作成します。徐々に結果を表示されるようにしたいので、create_chat_completionの引数にstream=Trueを与え、yieldで回答を返してもらいます。コード内のmsg['choice][0]['delta']のcontentに回答が入っているのですが、回答初めの返り値は、contentでなく、roleが出力されるので、キーにcontentが含まれるものだけをチャットに出力します。

def predict(message, history):
    messages = []
    for human_content, system_content in history:
        message_human = {
            "role":"user",
            "content":human_content+"\n",
        }
        message_system = {
            "role":"system",
            "content":system_content+"\n",
        }
        messages.append(message_human)
        messages.append(message_system)
    message_human = {
        "role":"user",
        "content": message+"\n",
    }
    messages.append(message_human)
    # Llama2回答
    streamer = llama.create_chat_completion(messages, stream=True)

    partial_message = ""
    for msg in streamer:
        message = msg['choices'][0]['delta']
        if 'content' in message:
            partial_message += message['content']
            yield partial_message

全コード

import gradio as gr
from llama_cpp import Llama

model_path = "llama-2-7b-chat.ggmlv3.q4_K_M.bin"
llama = Llama(model_path)

def predict(message, history):
    messages = []
    for human_content, system_content in history:
        message_human = {
            "role":"user",
            "content": human_content+"\n",
        }
        message_system = {
            "role":"system",
            "content": system_content+"\n",
        }
        messages.append(message_human)
        messages.append(message_system)
    message_human = {
        "role":"user",
        "content":message+"\n",
    }
    messages.append(message_human)
    # Llama2回答
    streamer = llama.create_chat_completion(messages, stream=True)

    partial_message = ""
    for msg in streamer:
        message = msg['choices'][0]['delta']
        if 'content' in message:
            partial_message += message['content']
            yield partial_message

gr.ChatInterface(predict).queue().launch()

create_chat_completionメソッドを利用しない方法(huggingfaceのデモテンプレートを利用する方法）

huggingfaceのblogを見ると、次の入力テンプレート紹介されています。

<s>[INST] <<SYS>>
{{system_prompt}}

<</SYS>>

{{user_msg_1}} [/INST] {{model_answer_1}} </s><s>[INST] {{ user_msg_2}} [/INST]
…

huggingfaceにあるllama2のチャットデモでも、<s>[INST]...のテンプレートが利用されています。このデモテンプレートに従い、チャット履歴をもとにプロンプトを作成します。huggingfaceのデモの記述のまま、プロンプトを作成するget_prompt関数を作ります（コピペです）。get_prompt関数の引数system_promptには、DEFAULT_SYSTEM_PROMPTのテキストデータ（おまじない）を入れます。

DEFAULT_SYSTEM_PROMPT="""You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."""

def get_prompt(message: str, chat_history: list[tuple[str, str]],
               system_prompt: str) -> str:
    texts = [f'<s>[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n']
    # The first user input is _not_ stripped
    do_strip = False
    for user_input, response in chat_history:
        user_input = user_input.strip() if do_strip else user_input
        do_strip = True
        texts.append(f'{user_input} [/INST] {response.strip()} </s><s>[INST] ')
    message = message.strip() if do_strip else message
    texts.append(f'{message} [/INST]')
    return ''.join(texts)

これを使用して、predict関数を作成します。デモテンプレートで作成したプロンプトは、create_completionに入力し、回答を予測します。これも徐々に表示させるため、stream=Trueにします。回答結果は、dict形式で、choicesのtextにあります。

def predict(message, history):
    
    prompt = get_prompt(message, history, DEFAULT_SYSTEM_PROMPT)
    # Llama2回答
    streamer = llama.create_completion(prompt, stream=True)

    partial_message = ""
    for msg in streamer:
        message = msg['choices'][0]
        if 'text' in message:
            new_token = message['text']
            if new_token != "<":
                partial_message += new_token
                yield partial_message

全コード

import gradio as gr
from llama_cpp import Llama
from typing import List,Tuple
DEFAULT_SYSTEM_PROMPT = """\
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\
"""
model_path = "llama-2-7b-chat.ggmlv3.q4_K_M.bin"
llama = Llama(model_path)

def get_prompt(message: str, chat_history: List[Tuple[str, str]],
               system_prompt: str) -> str:
    texts = [f'<s>[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n']
    # The first user input is _not_ stripped
    do_strip = False
    for user_input, response in chat_history:
        user_input = user_input.strip() if do_strip else user_input
        do_strip = True
        texts.append(f'{user_input} [/INST] {response.strip()} </s><s>[INST] ')
    message = message.strip() if do_strip else message
    texts.append(f'{message} [/INST]')
    return ''.join(texts)

def predict(message, history):
    
    prompt = get_prompt(message, history, DEFAULT_SYSTEM_PROMPT)
    streamer = llama.create_completion(prompt, stream=True)
    partial_message = ""
    for msg in streamer:
        message = msg['choices'][0]
        if 'text' in message:
            new_token = message['text']
            if new_token != "<":
                partial_message += new_token
                yield partial_message
        
gr.ChatInterface(predict).queue().launch()

回答結果の比較

2種類の質問をしてみます。一つ目は、Pythonが何かを聞くもので、2つ目は、Pythonのサンプルコードを書いてもらうものです。1つ目の質問では、チャット履歴を考慮しているかを確認します。2つ目は、テンプレートの違いを見ます。

質問１

Pythonが何かを聞いた後に、指示語を使って何人の人が使っているかを質問します。

create_chat_completionメソッド

デモテンプレート

それぞれの2つ目の質問でPythonと書いていませんが、どちらもthe languageをPythonと認識して回答しています。多少違いがありますが、どちらも問題なく回答しているように見えます。

質問２

Pythonでサンプルプログラムを書いてもらいます。

create_chat_completionメソッド

デモテンプレート

こちらには違いがでて、デモテンプレートを使用したものは、ユーザーの意図の確認が表示され、すぐにコード例を表示してもらえませんでした。

Llama2の論文では、回答のsaftyの議論にかなり割かれており、Llama2は、DEFAULT_SYSTEM_PROMPTの記述内容のもと、saftyやhelpfulに考慮されて、よく教育されているのでしょうか。DEFAULT_SYSTEM_PROMPTなしで、回答してもらうと、すぐにコードを書いてくれました（回数を試していないので、ただのブレかもしれませんが）。簡単なチャット例では、create_chat_completionを使っても、デモテンプレートを使っても、問題なく回答が返ってきたと感じています。

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

gradio ChatIterfaceとllama-cpp-pythonで、Llama2との対話型チャットボットを作成する

はじめに

環境

環境構築

gradioインストール

llama2の導入

llama2パラメータダウンロード

ChatInterfaceの基本的な構成

create_chat_completionメソッドを利用する方法

create_chat_completionメソッドを利用しない方法(huggingfaceのデモテンプレートを利用する方法）

回答結果の比較

質問１

create_chat_completionメソッド

デモテンプレート

質問２

create_chat_completionメソッド

デモテンプレート

参考

`ChatInterface`の基本的な構成