3
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

初めてのInf2インスタンスでLlama 3.2を起動してみた

Posted at

こちらのブログに紹介されている内容をやってみました。

通常、Inf2に搭載されているInferentiaチップで推論を行う場合、 Neuron Compilerを使ってモデルをコンパイルする必要があります。vLLMというライブラリーがInferentiaに対応しており、コンパイルの必要がありません。

簡単に使えるとのことで、試してみました。

EC2インスタンスを作成する

AMIは「Deep Learning AMI Neuron (Ubuntu 22.04)」を選びます。

インスタンスタイプは「inf2.xlarge」を選びます。

環境構築

セッションマネージャーで接続して作業します。

まず、ubuntuユーザーに切り替えます。

sudo su - ubuntu

Dockerfileを作成します。

cat > Dockerfile <<\EOF
# default base image
ARG BASE_IMAGE="public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.20.0-ubuntu20.04"
FROM $BASE_IMAGE
RUN echo "Base image is $BASE_IMAGE"
# Install some basic utilities
RUN apt-get update && \
    apt-get install -y \
        git \
        python3 \
        python3-pip \
        ffmpeg libsm6 libxext6 libgl1
### Mount Point ###
# When launching the container, mount the code directory to /app
ARG APP_MOUNT=/app
VOLUME [ ${APP_MOUNT} ]
WORKDIR ${APP_MOUNT}/vllm
RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install --no-cache-dir fastapi ninja tokenizers pandas
RUN python3 -m pip install sentencepiece transformers==4.36.2 -U
RUN python3 -m pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
RUN python3 -m pip install --pre neuronx-cc==2.15.* --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
ENV VLLM_TARGET_DEVICE neuron
RUN git clone https://github.com/vllm-project/vllm.git && \
    cd vllm && \
    git checkout v0.6.2 && \
    python3 -m pip install -U \
        cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2 \
        -r requirements-neuron.txt && \
    pip install --no-build-isolation -v -e . && \
    pip install --upgrade triton==3.0.0
CMD ["/bin/bash"]
EOF

Dockerイメージをビルドします。

docker build . -t vllm-neuron

Hugging Faceのトークンをセットします。

あらかじめモデルへのアクセス権限を取得しておいてください。

export HF_TOKEN="YOUR_TOKEN_HERE"

Dockerコンテナを起動します。

docker run \
        -it \
        -p 8000:8000 \
        --device /dev/neuron0 \
        -e HF_TOKEN=$HF_TOKEN \
        -e NEURON_CC_FLAGS=-O1 \
        vllm-neuron

ここからの作業はDockerコンテナ内です。
モデルをダウンロードして起動します。今回は「meta-llama/Llama-3.2-3B-Instruct」を使用します。

Neuronの場合にサポートしているモデルアーキテクチャは「MistralForCausalLM」「LlamaForCausalLM」の2つです。(24/11/28現在)

Dockerコンテナ内
vllm serve meta-llama/Llama-3.2-1B --device neuron --tensor-parallel-size 2 --block-size 8 --max-model-len 4096 --max-num-seqs 32

モデルのダウンロードにしばらく時間がかかります。

モデルのロードが終わると以下のようなログが出力されます。

WARNING 11-28 14:10:35 serving_embedding.py:189] embedding_mode is False. Embedding API will not work.
INFO 11-28 14:10:35 launcher.py:19] Available routes are:
INFO 11-28 14:10:35 launcher.py:27] Route: /openapi.json, Methods: HEAD, GET
INFO 11-28 14:10:35 launcher.py:27] Route: /docs, Methods: HEAD, GET
INFO 11-28 14:10:35 launcher.py:27] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 11-28 14:10:35 launcher.py:27] Route: /redoc, Methods: HEAD, GET
INFO 11-28 14:10:35 launcher.py:27] Route: /health, Methods: GET
INFO 11-28 14:10:35 launcher.py:27] Route: /tokenize, Methods: POST
INFO 11-28 14:10:35 launcher.py:27] Route: /detokenize, Methods: POST
INFO 11-28 14:10:35 launcher.py:27] Route: /v1/models, Methods: GET
INFO 11-28 14:10:35 launcher.py:27] Route: /version, Methods: GET
INFO 11-28 14:10:35 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 11-28 14:10:35 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 11-28 14:10:35 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [12]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 11-28 14:10:45 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 11-28 14:10:55 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation 

一度Dockerコンテナから抜けます。(ctrl + pを押してからctrl + q

Ubuntuユーザーで、vLLMにリクエストを送信します。

Ubuntu
curl localhost:8000/v1/models | jq .

モデルの情報が出力されます。

{
  "object": "list",
  "data": [
    {
      "id": "meta-llama/Llama-3.2-1B",
      "object": "model",
      "created": 1732803234,
      "owned_by": "vllm",
      "root": "meta-llama/Llama-3.2-1B",
      "parent": null,
      "max_model_len": 4096,
      "permission": [
        {
          "id": "modelperm-b0c0fc2e5b084cfe905961518a1e5811",
          "object": "model_permission",
          "created": 1732803234,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    }
  ]
}

それでは、テキスト生成を行ってみましょう。

Ubuntu
curl localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "What is Gen AI?", "temperature":0, "max_tokens": 128}' | jq .'
{
  "id": "cmpl-9815602063234a4ca04d6b4a89ed4410",
  "object": "text_completion",
  "created": 1732803360,
  "model": "meta-llama/Llama-3.2-1B",
  "choices": [
    {
      "index": 0,
      "text": " How does it work?\nGen AI is a new type of artificial intelligence that is designed to learn and adapt to new situations and environments. It is based on the idea that the human brain is a complex system that can learn and adapt to new situations and environments. Gen AI is designed to be able to learn and adapt to new situations and environments in a way that is similar to how the human brain does.\nGen AI is a new type of artificial intelligence that is designed to learn and adapt to new situations and environments. It is based on the idea that the human brain is a complex system that can learn and adaptto new situations and environments.",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 6,
    "total_tokens": 134,
    "completion_tokens": 128
  }
}

curlの出力は以下のとおりです。レスポンスを受けるまでの時間は4秒です。比較的早いと感じました。

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1067  100   966  100   101    244     25  0:00:04  0:00:03  0:00:01   270

参考

3
2
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
3
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?