More than 1 year has passed since last update.

DocsBotの仕組み

Last updated at 2023-03-25Posted at 2023-03-25

はじめに

世の中で話題になっているChatGPTですが、ChatGPTの内部で使われているGPT-3.5とGPT-4のモデルはファインチューニング（追加学習）することができません。ファインチューニングできるのは、davinci、curie、babbage、adaなどの2020年に公開されたGPT-3のモデルのみです（2023年3月時点）。

Fine-tuning is currently only available for the following base models: davinci, curie, babbage, and ada. These are the original models that do not have any instruction following training (like text-davinci-003 does for example).

引用元：https://platform.openai.com/docs/guides/fine-tuning

GPT-3のモデルを頑張ってファインチューニングする手もありますが、正直精度的になかなか厳しいです。可能であれば、GPT-3.5やGPT-4のモデルの恩恵を預かりたいところです。これらのモデルをファインチューニングする方法はないかと考えている人も多いと思います。

DocsBotとは？

DocsBotは、GPT-3.5やGPT-4をあたかもファインチューニングしたかのように使うことができるサービスです。日本語でも使用例が幾つか紹介されています。

https://note.com/u17da/n/nd5a15910e564
https://twitter.com/masahirochaen/status/1635136368300163079

使い方は、文書を登録するだけです。ファイル形式は一般的なものには一通り対応しており、WebページやYouTubeの字幕にも対応しているようです。ChatGPTのプロンプトとは異なり¹、登録できる情報に厳しい制約はありません²。そして、文書を登録した上で質問を送ると、登録した文書の情報を考慮した返答が返ってきます。ファインチューニングしているわけでもないのに、不思議です。

DocsBotの仕組み

ファインチューニングしていないのに、どうしてファインチューニングしたかのような結果が返ってくるのでしょうか？その仕組みを簡単に説明します。

公式サイトのFAQを見ると、次のように書かれています。

How does DocsBot work?

It's a bit technical, but here is a brief overview. We use OpenAI's embedding and ChatGPT APIs, as well as vector databases to store our index. All ingested documentation is cleaned up and divided into smaller chunks and labeled by source. We then use the GPT embedding API to generate a vector representation of each chunk and store it in our vector db index. When a user asks a question, we convert it to an embedding, and perform an advanced semantic search for closest matches to the user's query. Then we take the most relevant chunks, included them as context along with the original question, and use the ChatGPT API to generate a response in markdown format that we then convert to HTML and display to the user.

英語なので分かりにくいかもしれませんが、仕組みは次のように非常にシンプルです。

事前に文書を登録する
- その際、裏側でチャンクという単位（イメージとしては項目毎）に分けて登録しておく
質問が来た際、質問の内容と最も関連性の高いチャンクを見つける
質問と2で見つけたチャンクをChatGPTに投げて応答を生成する

1と2はDocsBot側で行っています。関連性の高いチャンクを見つけるには、embedding（分散表現）が利用されています。類似文書検索でよく使われる技術です。そして、チャンクはChatGPTに送ることができる最大トークン数などを考慮して分けていると考えられます。ChatGPTが使わているのは3の時だけです。

さいごに

DocsBotの仕組みを簡単に説明しました。ChatGPTに投げる前にちょっとした処理を挟むことで、GPT-3.5やGPT-4をあたかもファインチューニングしたかのように使うことができます。今後、日本でもDocsBotを利用したサービスがどんどん出てくると思います。

gpt-3.5-turboは最大トークン数が4,096、gpt-4は最大トークン数が8,192 ↩
Powerプランの場合「5k Source Pages」まで登録可能（2023年3月時点） ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up