自分専用の大規模言語モデルを動作させるAPIサーバを立ててみた

Last updated at 2024-01-05Posted at 2023-12-19

はじめに

自分専用の大規模言語モデルを動作させるAPIサーバを用意する手順をざっと紹介します。といっても、手順は以前公開したGoogle Colabでtext generation webuiを起動し、Llama 2と会話してみたとほぼ同じです。

前提条件

Google Colabを使えるアカウントを持っていること
Curlを使えること

手順

手順は以前公開した記事とほぼ同じ、と先にコメントしましたが、違いが大きく2点あります。

Launch the web UIを実行する際に設定を変更する
Web UIではなく、OpenAI API互換のRESTコールによる実行となる

セットアップ

Launch the web UIを実行する際に設定を次のように変更します。例はLlama 2 7Bの量子化されたモデルを起動時にロードする設定です。必要に応じて、モデルは選択して下さい。モデルを選ぶ際は、モデルサイズが重要となります。本記事の注意事項に参考情報を記載しています。

コピペできた方が良いと思うので、model_urlのみ記載します。

model_url: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF

忘れてはならないのがapiのチェックになります。これにより、apiコールができるようになります。

パブリックURLを確認

次のような形で実行中にURLが2箇所、出力されます。INFO:OpenAI-compatible API URL:の次に表示されているURLがAPIのエンドポイントとなります。起動ごとに変わるので、都度確認が必要となります。

 * Downloading cloudflared for Linux x86_64...
/usr/local/lib/python3.10/dist-packages/gradio/components/dropdown.py:231: UserWarning: The value passed into gr.Dropdown() is not in the list of choices. Please update the list of choices to include: llama or set allow_custom_value=True.
  warnings.warn(
Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://0aada85cadf60958d7.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)
2023-12-17 13:25:38 INFO:OpenAI-compatible API URL:

https://worth-past-um-municipal.trycloudflare.com

INFO:     Started server process [4326]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:5000 (Press CTRL+C to quit)

CurlにてAPIコール

Curlを実行可能な端末にて実行して下さい。私はMacbook Proにて実行しています。OpenAI API互換のエンドポイントを下記のような形で実行します。{YOUR-PUBLIC-ENDPOINT}はご自分で実行したURLに差し替えて下さい。質問内容は、脳卒中への対応方法を教えて下さい。となります。

curl https://{YOUR-PUBLIC-ENDPOINT}/v1/completions    -H "Content-Type: application/json"   -d '{
    "prompt": "Tell me how to react to stroke:\n\n1.", 
    "max_tokens": 200,
    "temperature": 1,
    "top_p": 0.9,
    "seed": 10
  }'

APIコールの結果

先に記載したコマンドを使った結果の例です。

{"id":"conv-1702900718814065152","object":"text_completion","created":1702900718,"model":"TheBloke_Llama-2-7B-Chat-GGUF","choices":[{"index":0,"finish_reason":"stop","text":" Stay calm and assess the situation.\n2. Check the person's airway, breathing, and pulse.\n3. Call 911 or your local emergency number immediately.\n4. Begin CPR if the person is unconscious, not breathing, or not responsive.\n5. Provide basic first aid, such as stopping bleeding, if possible.\n6. Keep the person warm and comfortable.\n7. Transport the person to the hospital as quickly and safely as possible.\n8. Follow the instructions of medical professionals when you arrive at the hospital.\n\n","logprobs":{"top_logprobs":[{}]}}],"usage":{"prompt_tokens":13,"completion_tokens":130,"total_tokens":143}}

こちらは、meditron (医療に特化したオープンソースの大規模言語モデル) 7B(70億パラメータ)を量子化したモデルを使った場合の同じ質問に対する例です。Llama 2をベースに医学ガイドラインなどを含む医学コーパスにてトレーニングされているとのことです。安全性などの検証がこれから、ということのようで、本番環境での利用には適さないという注意書きが記載されています。

curl https://skip-status-saddam-weblogs.trycloudflare.com/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Tell me how to react to stroke:\n\n1.", 
    "max_tokens": 200,
    "temperature": 1,
    "top_p": 0.9,
    "seed": 10
  }'
{"id":"conv-1702806654886163968","object":"text_completion","created":1702806654,"model":"meditron-7b-chat.Q4_K_M.gguf","choices":[{"index":0,"finish_reason":"length","text":" Call 911 and activate the stroke emergency response system in your area.\n\n2. If a loved one is having a stroke, stay calm and tell them to call 911.\n\n3. If possible, help the person to sit up or lie down to the side, and protect the person's neck, to keep the airway clear.\n\n4. Check the person's pulse and breathing, and tell emergency medical services (EMS) what symptoms the person is experiencing.\n\n5. If the person is unable to communicate, start by asking the person to open their eyes, or if that doesn't work, look at the person's face and mouth. If the person is unconscious or unable to respond, gently place the person's hand on the neck to check if they have a pulse, or ask someone else who knows them to do so.\n\n6. If","logprobs":{"top_logprobs":[{}]}}],"usage":{"prompt_tokens":13,"completion_tokens":202,"total_tokens":215}}

注意事項

利用できるモデルのサイズについて

ロードするモデルサイズがVRAMに乗らないものを選ぶと、起動に失敗することになります。7B(70億パラメータ)で量子化されたモデルは5GB程度が多いので、こちらの利用が失敗する確率が減るかと思います。ということで、モデルの選定は慎重に行ってください。

ちなみに、Google Colabの無償版で利用可能なGPU NVIDIA T4はVRAM 16GBとなります。

GPUを利用できる時間について

コンピューティングユニットという仕組みで制限されています。GPUを使った検証を行うと、おそらくアッという間にコンピューティングユニットを使い尽くしてしまいます。さまざまなモデルを使った検証を行う場合などは、有償版の利用が良いと思います。

参考

github.com/oobabooga/text-generation-webui
github.com/oobabooga/text-generation-webui/12 - OpenAI API
Llama 2 70億パラメータモデル(量子化版)huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF
メディカル向け 70億パラメータモデル(量子化版)huggingface.co/TheBloke/meditron-7B-chat-GGUF
github.com/epfLLM/meditron

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up