OCI上でKubernetes(OKE)にLLMを展開してみよう

Last updated at 2023-12-05Posted at 2023-12-05

はじめに

この記事は、Oracle Cloud Infrastructure Advent Calender 2023 シリーズ 2 Day6の記事として書いています。

2023年はLLMなどの生成系AIで非常に盛り上がった1年になりました。
特にOpenAI社のGPTないしGPTをチャットUIから利用できるようにしたChatGPTは開発者にも多大な影響を与えたと思います。

この記事では、そんなLLMをKubernetes上に載せてみるということをやってみたいと思います。

LocalAI

皆さんはLocalAIをご存知でしょうか？

OpenAIのオープンソースalternativeとして開発されているGPU不要かつオンプレミスなどのプライベート環境でも実行可能なプラットフォームです。
GPTモデル以外にもLlamaやwhisperなどの複数のLLMをサポートしています。
これを使えばプライベートなChatGPTみたいなものも構築できてしまいます！

今回はLocalAIをKubernetes上に展開し、プライベートなLLM環境とエンドポイントを構築してみたいと思います。

OKEの準備

OKEのプロビジョニングについてはいつも通りこちらのチュートリアルを参考に行なってください。
必要に応じてOCPUやメモリを設定してください。

今回の私の環境は1OCPU/8GBの3ノード環境を利用します。

Boot Volumeの拡張

今回のユースケースではコンテナ内にモデルをダウンロードする関係上、動作するコンテナサイズがかなり大きくなります。
プロビジョニング時に以下の拡張オプションでブートボリュームを500GBに設定してください。

さらに以下のcloud-initスクリプトを指定してください。

#!/bin/bash
curl --fail -H "Authorization: Bearer Oracle" -L0 http://169.254.169.254/opc/v2/instance/metadata/oke_init_script | base64 --decode >/var/run/oke-init.sh
bash /var/run/oke-init.sh
sudo dd iflag=direct if=/dev/oracleoci/oraclevda of=/dev/null count=1
echo "1" | sudo tee /sys/class/block/`readlink /dev/oracleoci/oraclevda | cut -d'/' -f 2`/device/rescan
sudo /usr/libexec/oci-growfs -y

LocalAIのデプロイ

LocalAIはHelmで簡単にインストールできるので、この記事でもHelmを使っていきます。

まずはHelmレポジトリを追加します。
手順はこちらを参考に実施します。

$ helm repo add go-skynet https://go-skynet.github.io/helm-charts/

次にHelmレポジトリをアップデートします。

$ helm repo update

values.yamlを作成します。

$ helm show values go-skynet/local-ai > values.yaml

以下のようなファイルが作成されます。

replicaCount: 1

deployment:
  image: quay.io/go-skynet/local-ai:latest
  env:
    threads: 4
    context_size: 512
  modelsPath: "/models"
  download_model:
    # To use cloud provided (eg AWS) image, provide it like: 1234356789.dkr.ecr.us-REGION-X.amazonaws.com/busybox
    image: busybox
  prompt_templates:
    # To use cloud provided (eg AWS) image, provide it like: 1234356789.dkr.ecr.us-REGION-X.amazonaws.com/busybox
    image: busybox
  pullPolicy: IfNotPresent
  imagePullSecrets: []
    # - name: secret-names

resources:
  {}
  # We usually recommend not to specify default resources and to leave this as a conscious
  # choice for the user. This also increases chances charts run on environments with little
  # resources, such as Minikube. If you do want to specify resources, uncomment the following
  # lines, adjust them as necessary, and remove the curly braces after 'resources:'.
  # limits:
  #   cpu: 100m
  #   memory: 128Mi
  # requests:
  #   cpu: 100m
  #   memory: 128Mi

# Prompt templates to include
# Note: the keys of this map will be the names of the prompt template files
promptTemplates:
  {}
  # ggml-gpt4all-j.tmpl: |
  #   The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
  #   ### Prompt:
  #   {{.Input}}
  #   ### Response:

# Models to download at runtime
models:
  # Whether to force download models even if they already exist
  forceDownload: false

  # The list of URLs to download models from
  # Note: the name of the file will be the name of the loaded model
  list:
  #  - url: "https://gpt4all.io/models/ggml-gpt4all-j.bin"
      # basicAuth: base64EncodedCredentials

  # Persistent storage for models and prompt templates.
  # PVC and HostPath are mutually exclusive. If both are enabled,
  # PVC configuration takes precedence. If neither are enabled, ephemeral
  # storage is used.
  persistence:
    pvc:
      enabled: false
      size: 6Gi
      accessModes:
        - ReadWriteOnce

      annotations: {}

      # Optional
      storageClass: ~

    hostPath:
      enabled: false
      path: "/models"

service:
  type: ClusterIP
  # If deferring to an internal only load balancer
  # externalTrafficPolicy: Local
  port: 80
  annotations: {}
  # If using an AWS load balancer, you'll need to override the default 60s load balancer idle timeout
  # service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "1200"

ingress:
  enabled: false
  className: ""
  annotations:
    {}
    # kubernetes.io/ingress.class: nginx
    # kubernetes.io/tls-acme: "true"
  hosts:
    - host: chart-example.local
      paths:
        - path: /
          pathType: ImplementationSpecific
  tls: []
  #  - secretName: chart-example-tls
  #    hosts:
  #      - chart-example.local

nodeSelector: {}

tolerations: []

affinity: {}

image:
  pullPolicy: IfNotPresent

今回はデフォルトのggml-gpt4all-jというモデルを使いたいと思います。
GPT4ALLはA free-to-use, locally running, privacy-aware chatbot. No GPU or internet required.を謳っているプライベートで利用できるプラットフォームです。
こちらも複数LLMをサポートしますが、モデルによってはAPIキーが必要な場合もあります。

今回利用するggml-gpt4all-jはnomic-ai社が提供している大規模なカリキュラムベースのアシスタント対話データセットを含む、Apache License 2.0のチャットボットです。

values.yamlの50行目付近にあるコメントアウトを外します。

  # The list of URLs to download models from
  # Note: the name of the file will be the name of the loaded model
  list:
    - url:
        "https://gpt4all.io/models/ggml-gpt4all-j.bin"
        # basicAuth: base64EncodedCredentials

インストールします。

$ helm install local-ai go-skynet/local-ai -f values.yaml

モデルのダウンロード含めて起動までにかなり時間がかかると思うので、しばらく待機します。

$ kubectl get pods -w
NAME                       READY   STATUS     RESTARTS   AGE
local-ai-99c54b896-g9s4j   0/1     Init:0/1   0          59s
local-ai-99c54b896-bwrqd   0/1     PodInitializing   0          2m6s
local-ai-99c54b896-bwrqd   1/1     Running           0          8m12s

このモデルは以下のエンドポイントでアクセス可能です。

http://local-ai/v1/models

動作確認

実際にクラスタ内のテスト用コンテナからcurlでアクセスしてみましょう。

kubectl run test-access --restart=Never --image=nginx --rm -it -- curl http://local-ai/v1/models

すると以下のようなレスポンスが返ってきます。

{"object":"list","data":[{"id":"ggml-gpt4all-j","object":"model"}]}

これで無事にLLMがKubernetes上に展開されました！

意外に簡単にデプロイできましたね！！

次回は、今回デプロイしたLLMとLangChainを利用してアプリケーションを実装してみようと思います。

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up