MetaのLlama 2をDatabricksでサービングしてみる

Posted at 2023-07-24

さらにこちらの続きです。

ノートブック03_serve_driver_proxyを実行します。

ここでは、ドライバープロキシーを用いてモデルサービングを行います。開発段階やテスト段階において、お手軽にLLMをサービングすることができます。

ノートブックの実行

モデルのダウンロード。

# Load model to text generation pipeline
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

# it is suggested to pin the revision commit hash and not change it for reproducibility because the uploader might change the model afterwards; you can find the commmit history of llamav2-7b-chat in https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/commits/main
model = "meta-llama/Llama-2-7b-chat-hf"
revision = "0ede8dd71e923db6258295621d817ca8714516d4"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
    revision=revision,
    return_full_text=False, # don't return the prompt, only return the generated response
)

プロンプトをラッピングする関数。

# Prompt templates as follows could guide the model to follow instructions and respond to the input, and empirically it turns out to make Falcon models produce better responses
INSTRUCTION_KEY = "### Instruction:"
RESPONSE_KEY = "### Response:"
INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
PROMPT_FOR_GENERATION_FORMAT = """{intro}
{instruction_key}
{instruction}
{response_key}
""".format(
    intro=INTRO_BLURB,
    instruction_key=INSTRUCTION_KEY,
    instruction="{instruction}",
    response_key=RESPONSE_KEY,
)

# Define parameters to generate text
def gen_text_for_serving(prompt, **kwargs):
    prompt = PROMPT_FOR_GENERATION_FORMAT.format(instruction=prompt)

    # the default max length is pretty small (20), which would cut the generated output in the middle, so it's necessary to increase the threshold to the complete response
    if "max_new_tokens" not in kwargs:
        kwargs["max_new_tokens"] = 512

    # configure other text generation arguments
    kwargs.update(
        {
            "pad_token_id": tokenizer.eos_token_id,  # Hugging Face sets pad_token_id to eos_token_id by default; setting here to not see redundant message
            "eos_token_id": tokenizer.eos_token_id,
        }
    )

    return pipeline(prompt, **kwargs)[0]['generated_text']

動作確認。

print(gen_text_for_serving("How to master Python in 3 days?"))

Mastering Python in 3 days is an ambitious goal, but it's possible with a structured approach and a good understanding of the language's fundamentals. Here's a suggested plan to help you achieve this:

Day 1:

1. Learn the basics of Python syntax and data types. Understand the difference between Python 2 and Python 3.
2. Get familiar with the standard library and its most commonly used modules.
3. Practice writing simple programs to get a feel for the language.

Day 2:

1. Learn about control structures (if/else statements, for loops, while loops).
2. Understand functions and how to define and call them.
3. Learn about data structures (lists, tuples, dictionaries).
4. Practice working with data structures and functions.

Day 3:

1. Learn about object-oriented programming (OOP) concepts in Python.
2. Understand classes and objects.
3. Learn about inheritance and polymorphism.
4. Practice creating and using your own classes.

In addition to the above, it's important to practice coding regularly and to work on small projects to apply what you've learned. You can find many resources online, such as tutorials, videos, and coding challenges, to help you learn Python quickly and effectively. Good luck!

Flaskによるサービング

from flask import Flask, jsonify, request

app = Flask("llama-2-7b-chat")

@app.route('/', methods=['POST'])
def serve_falcon_7b_instruct():
  resp = gen_text_for_serving(**request.json)
  return jsonify(resp)

from dbruntime.databricks_repl_context import get_context
ctx = get_context()

port = "7777"
driver_proxy_api = f"https://{ctx.browserHostName}/driver-proxy-api/o/0/{ctx.clusterId}/{port}"

print(f"""
driver_proxy_api = '{driver_proxy_api}'
cluster_id = '{ctx.clusterId}'
port = {port}
""")

ドライバープロキシーの接続情報が表示されます。

driver_proxy_api = 'https://xxxxxx.cloud.databricks.com/driver-proxy-api/o/0/aaaaaaa/7777'
cluster_id = 'aaaaaaa'
port = 7777

Flaskを実行することで、ドライバープロキシーでLLMがサービングされるようになります。

app.run(host="0.0.0.0", port=port, debug=True, use_reloader=False)

クライアントアプリからアクセスするためにパーソナルアクセストークンを取得しておきます。

クライアントアプリの構築

Databricksにアクセスできるローカルマシンでchainlitアプリを構築します。

databricks_proxy.py

import os
from langchain import PromptTemplate, OpenAI, LLMChain
import chainlit as cl
from chainlit import on_message

import requests
import json

def request_llamav2_7b(prompt, temperature=1.0, max_new_tokens=1024):
  token = "<Databricksパーソナルアクセストークン>"
  url = "<driver_proxy_api>"
  
  headers = {
      "Content-Type": "application/json",
      "Authentication": f"Bearer {token}"
  }
  data = {
    "prompt": prompt,
    "temperature": temperature,
    "max_new_tokens": max_new_tokens,
  }

  response = requests.post(url, headers=headers, data=json.dumps(data))
  return response.text

@on_message
def main(message: str):

    response = request_llamav2_7b(message)

    # Send a response back to the user
    cl.Message(
        content=f"{response}",
    ).send()

アプリを起動します。

chainlit run databricks_proxy.py -w

動きました！

Databricksクイックスタートガイド

Databricks無料トライアル

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up