初めてのInf2インスタンスでLlama 3.2を起動してみた

Posted at 2024-11-28

こちらのブログに紹介されている内容をやってみました。

通常、Inf2に搭載されているInferentiaチップで推論を行う場合、 Neuron Compilerを使ってモデルをコンパイルする必要があります。vLLMというライブラリーがInferentiaに対応しており、コンパイルの必要がありません。

簡単に使えるとのことで、試してみました。

EC2インスタンスを作成する

AMIは「Deep Learning AMI Neuron (Ubuntu 22.04)」を選びます。

インスタンスタイプは「inf2.xlarge」を選びます。

環境構築

セッションマネージャーで接続して作業します。

まず、ubuntuユーザーに切り替えます。

sudo su - ubuntu

Dockerfileを作成します。

cat > Dockerfile <<\EOF
# default base image
ARG BASE_IMAGE="public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.20.0-ubuntu20.04"
FROM $BASE_IMAGE
RUN echo "Base image is $BASE_IMAGE"
# Install some basic utilities
RUN apt-get update && \
    apt-get install -y \
        git \
        python3 \
        python3-pip \
        ffmpeg libsm6 libxext6 libgl1
### Mount Point ###
# When launching the container, mount the code directory to /app
ARG APP_MOUNT=/app
VOLUME [ ${APP_MOUNT} ]
WORKDIR ${APP_MOUNT}/vllm
RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install --no-cache-dir fastapi ninja tokenizers pandas
RUN python3 -m pip install sentencepiece transformers==4.36.2 -U
RUN python3 -m pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
RUN python3 -m pip install --pre neuronx-cc==2.15.* --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
ENV VLLM_TARGET_DEVICE neuron
RUN git clone https://github.com/vllm-project/vllm.git && \
    cd vllm && \
    git checkout v0.6.2 && \
    python3 -m pip install -U \
        cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2 \
        -r requirements-neuron.txt && \
    pip install --no-build-isolation -v -e . && \
    pip install --upgrade triton==3.0.0
CMD ["/bin/bash"]
EOF

Dockerイメージをビルドします。

docker build . -t vllm-neuron

Hugging Faceのトークンをセットします。

あらかじめモデルへのアクセス権限を取得しておいてください。

export HF_TOKEN="YOUR_TOKEN_HERE"

Dockerコンテナを起動します。

docker run \
        -it \
        -p 8000:8000 \
        --device /dev/neuron0 \
        -e HF_TOKEN=$HF_TOKEN \
        -e NEURON_CC_FLAGS=-O1 \
        vllm-neuron

ここからの作業はDockerコンテナ内です。
モデルをダウンロードして起動します。今回は「meta-llama/Llama-3.2-3B-Instruct」を使用します。

Neuronの場合にサポートしているモデルアーキテクチャは「MistralForCausalLM」「LlamaForCausalLM」の２つです。（24/11/28現在）

Dockerコンテナ内

vllm serve meta-llama/Llama-3.2-1B --device neuron --tensor-parallel-size 2 --block-size 8 --max-model-len 4096 --max-num-seqs 32

モデルのダウンロードにしばらく時間がかかります。

モデルのロードが終わると以下のようなログが出力されます。

WARNING 11-28 14:10:35 serving_embedding.py:189] embedding_mode is False. Embedding API will not work.
INFO 11-28 14:10:35 launcher.py:19] Available routes are:
INFO 11-28 14:10:35 launcher.py:27] Route: /openapi.json, Methods: HEAD, GET
INFO 11-28 14:10:35 launcher.py:27] Route: /docs, Methods: HEAD, GET
INFO 11-28 14:10:35 launcher.py:27] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 11-28 14:10:35 launcher.py:27] Route: /redoc, Methods: HEAD, GET
INFO 11-28 14:10:35 launcher.py:27] Route: /health, Methods: GET
INFO 11-28 14:10:35 launcher.py:27] Route: /tokenize, Methods: POST
INFO 11-28 14:10:35 launcher.py:27] Route: /detokenize, Methods: POST
INFO 11-28 14:10:35 launcher.py:27] Route: /v1/models, Methods: GET
INFO 11-28 14:10:35 launcher.py:27] Route: /version, Methods: GET
INFO 11-28 14:10:35 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 11-28 14:10:35 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 11-28 14:10:35 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [12]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 11-28 14:10:45 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 11-28 14:10:55 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation

一度Dockerコンテナから抜けます。（ctrl + pを押してからctrl + q）

Ubuntuユーザーで、vLLMにリクエストを送信します。

Ubuntu

curl localhost:8000/v1/models | jq .

モデルの情報が出力されます。

{
  "object": "list",
  "data": [
    {
      "id": "meta-llama/Llama-3.2-1B",
      "object": "model",
      "created": 1732803234,
      "owned_by": "vllm",
      "root": "meta-llama/Llama-3.2-1B",
      "parent": null,
      "max_model_len": 4096,
      "permission": [
        {
          "id": "modelperm-b0c0fc2e5b084cfe905961518a1e5811",
          "object": "model_permission",
          "created": 1732803234,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    }
  ]
}

それでは、テキスト生成を行ってみましょう。

Ubuntu

curl localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "What is Gen AI?", "temperature":0, "max_tokens": 128}' | jq .'

{
  "id": "cmpl-9815602063234a4ca04d6b4a89ed4410",
  "object": "text_completion",
  "created": 1732803360,
  "model": "meta-llama/Llama-3.2-1B",
  "choices": [
    {
      "index": 0,
      "text": " How does it work?\nGen AI is a new type of artificial intelligence that is designed to learn and adapt to new situations and environments. It is based on the idea that the human brain is a complex system that can learn and adapt to new situations and environments. Gen AI is designed to be able to learn and adapt to new situations and environments in a way that is similar to how the human brain does.\nGen AI is a new type of artificial intelligence that is designed to learn and adapt to new situations and environments. It is based on the idea that the human brain is a complex system that can learn and adaptto new situations and environments.",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 6,
    "total_tokens": 134,
    "completion_tokens": 128
  }
}

curlの出力は以下のとおりです。レスポンスを受けるまでの時間は4秒です。比較的早いと感じました。

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1067  100   966  100   101    244     25  0:00:04  0:00:03  0:00:01   270

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up