K8sGPTとBedrockでEKSクラスタのトラブル分析を行う!

Posted at 2024-09-22

こんにちは!
今回はK8sGPTとBedrockでEKSクラスタのトラブル分析やってみました!

K8sGPTとは?

K8sGPTは生成AIを使ってKubernetesクラスタ内で発生しているトラブルの内容を分かり易い文章で教えてくれるツールです。

K8sGPTは2023年12月19日にCNCFのSandboxプロジェクトに登録されています。
CNCFはプロジェクトを成熟度レベル別に三段階(Sandbox、Incubating、Graduated)に分類しており、Sandboxは第1段階の分類になります。

K8sGPTは複数の生成AIサービスをプロバイダとして接続して使います。
2024年9月現在で以下の生成AIサービスを利用できます。

OpenAI
Cohere
Amazon Bedrock
Amazon SageMaker
Azure OpenAI
Google Gemini
Google Vertex AI
Hugging Face
IBM watsonx.ai
LocalAI
Ollama
FakeAI

今回はAWS環境での利用を想定して、プロバイダにAmazon Bedrockを選択してEKSクラスタのトラブル分析できるか試してみようと思います。

では早速やっていきましょう!

1. K8sGPTをインストール

まずはK8sGPTをインストールします。ドキュメントに沿ってインストールしていきます。

今回はAWSのCloudShellを使いますので、64bitのRPMでインストールします。

実行コマンド.

curl -LO https://github.com/k8sgpt-ai/k8sgpt/releases/download/v0.3.24/k8sgpt_amd64.rpm
sudo rpm -ivh -i k8sgpt_amd64.rpm

コマンド実行例.

[cloudshell-user@ip-10-134-61-137 ~]$ curl -LO https://github.com/k8sgpt-ai/k8sgpt/releases/download/v0.3.24/k8sgpt_amd64.rpm
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 21.4M  100 21.4M    0     0  83.3M      0 --:--:-- --:--:-- --:--:-- 83.3M
[cloudshell-user@ip-10-134-61-137 ~]$ sudo rpm -ivh -i k8sgpt_amd64.rpm
Verifying...                          ################################# [100%]
Preparing...                          ################################# [100%]
Updating / installing...
   1:k8sgpt-0:0.3.24-1                ################################# [100%]
[cloudshell-user@ip-10-134-61-137 ~]$

なんかすんなりインストールできたみたいです。
ちゃんとインストールできているかコマンドを叩いて確認してみましょう。

実行コマンド.

k8sgpt version

コマンド実行例.

[cloudshell-user@ip-10-134-61-137 ~]$ k8sgpt version
k8sgpt: 0.3.24 (eac9f07), built at: unknown

ちゃんとバージョンが返ってきましたね。正常にインストールできているみたいです!

2. BedrockをK8sGPTのプロバイダーに登録する

次にBedrockをK8sGPTのプロバイダーに登録します。
Bedrockの基盤モデルを利用するためには予めモデルへのアクセスをリクエストする必要があります。
リクエストがまだの方は以下のAWSドキュメントを参考に実施ください。

今回は基盤モデルにAnthropic社の最新モデルであるClaude 3.5 Sonnetを使ってみようと思います。

まずは、以下のAWS情報を環境変数として登録してください。

AWS_ACCESS_KEY
AWS_SECRET_ACCESS_KEY
AWS_DEFAULT_REGION

次にk8sgpt auth addコマンドでプロバイダを登録します。
--backendオプションでamazonbedrockを、
--modelオプションで基盤モデルのモデルIDを指定してください。

実行コマンド.

k8sgpt auth add --backend amazonbedrock --model anthropic.claude-3-5-sonnet-20240620-v1:0

コマンド実行例.

[cloudshell-user@ip-10-134-61-137 ~]$ k8sgpt auth add --backend amazonbedrock --model anthropic.claude-3-5-sonnet-20240620-v1:0
amazonbedrock added to the AI backend provider list

ちゃんとプロバイダ登録できた様なメッセージが返ってきました。
k8sgpt auth listコマンドでも登録できたか確認してみましょう。

コマンド実行例.

[cloudshell-user@ip-10-134-61-137 ~]$ k8sgpt auth list
Default: 
> openai
Active: 
> amazonbedrock
Unused: 
> openai
> localai
> azureopenai
> noopai
> cohere
> amazonsagemaker

amazonbedrockがActiveになっているので大丈夫そうです!

3. K8sGPTでトラブル分析する!

セットアップも完了しましたのでいよいよトラブル分析してみます!
今回はイメージの指定が誤ったPodを作成してみます。

実行コマンド.

kubectl run problem-pod --image hogehoge

コマンド実行例.

[cloudshell-user@ip-10-134-61-137 ~]$ kubectl run problem-pod --image hogehoge 
pod/problem-pod created
[cloudshell-user@ip-10-134-61-137 ~]$ kubectl get pod 
NAME          READY   STATUS         RESTARTS   AGE
problem-pod   0/1     ErrImagePull   0          7s

イメージ名を適当(hogehoge)に設定しているためイメージをダウンロードできない状態になっている事がSTATUSフィールドがErrImagePullになっている事から判りますね。

この状態をK8sGPTで検出できるか、、、試してみようと思います。

実行コマンド.

k8sgpt analyze --explain --backend amazonbedrock

k8sgpt analyzeコマンドでクラスタのトラブル分析を行います。
--explainオプションを指定すると検出された問題とそれに対するが解決策を説明してくれる様です。
--backendオプションには先程登録したプロバイダーであるamazonbedrockを指定してください。

コマンド実行例.

[cloudshell-user@ip-10-134-61-137 ~]$ k8sgpt analyze --explain --backend amazonbedrock 
 100% |████████████████████████████████████████████████████████████████████████████████████| (1/1, 6 it/min)        
AI Provider: amazonbedrock

0 default/problem-pod(problem-pod)
- Error: failed to pull and unpack image "docker.io/library/hogehoge:latest": failed to resolve reference "docker.io/library/hogehoge:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed
 It looks like you are trying to pull an image called "hogehoge" from Docker Hub, but are getting an error that this repository does not exist or you don't have access to it. 

The key parts of the error are:

- "failed to resolve reference "docker.io/library/hogehoge:latest"" - This means Docker couldn't find the image name you specified.

- "pull access denied, repository does not exist or may require authorization" - This indicates the image doesn't exist or you need to be authorized to access it. 

- "insufficient_scope: authorization failed" - You don't have permission to access the image.

Some things to try:

- Double check the image name is correct. "hogehoge" is likely not a real image.

- If the image is private, make sure you are logged in to Docker Hub with `docker login` and have access to the repository.

- If the image exists in a different registry, you need to specify the full registry URL, not just "docker.io".

- Try pulling a public image like "docker.io/library/alpine" to test your Docker setup.

- Check your internet connection and Docker daemon configuration. A network or firewall issue could also cause this.

So in summary, the error means the image name you specified couldn't be found or you don't have access to it. Verify the image name, repository access, network connection, and Docker daemon configuration.

おぉ、ちゃんと作成したPodが検出されていますね!
解決に向けたアドバイスも色々返してくれています。

Double check the image name is correct. "hogehoge" is likely not a real image のアドバイスが正解ですね。

4. OperatorとしてK8sGPTを稼動させる

K8sGPTはOperatorとしてKubernetesクラスタ内部で稼動させる事もできます。
Operatorとして稼動させる事でクラスタ内部の問題を自動的に分析してくれます。

Operatorはhelmで入れていきます。

実行コマンド.

helm repo add k8sgpt https://charts.k8sgpt.ai/
helm repo update
helm install release k8sgpt/k8sgpt-operator -n k8sgpt --create-namespace

K8sGPTのhelmリポジトリを追加した後にリポジトリをアップデートし、最後にリポジトリからk8sgpt-operatorをインストールします。
-n k8sgpt --create-namespaceでk8sgptNamespaceを作成しつつ、このNamespaceにOperatorをインストールします。

コマンド実行例.

[cloudshell-user@ip-10-134-61-137 ~]$ helm repo add k8sgpt https://charts.k8sgpt.ai/
"k8sgpt" has been added to your repositories
[cloudshell-user@ip-10-134-61-137 ~]$ helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "k8sgpt" chart repository
Update Complete. ⎈Happy Helming!⎈
[cloudshell-user@ip-10-134-61-137 ~]$ 
[cloudshell-user@ip-10-134-61-137 ~]$ helm install release k8sgpt/k8sgpt-operator -n k8sgpt --create-namespace
NAME: release
LAST DEPLOYED: Sun Sep 22 15:32:39 2024
NAMESPACE: k8sgpt
STATUS: deployed
REVISION: 1
TEST SUITE: None

コマンド実行例.

[cloudshell-user@ip-10-134-61-137 ~]$ kubectl api-resources  | grep -i gpt
k8sgpts                                          core.k8sgpt.ai/v1alpha1           true         K8sGPT
results                                          core.k8sgpt.ai/v1alpha1           true         Result

K8sGPT OperatorのCRDが追加されました!
K8sGPTはK8sGPTの設定を行い、Resultはトラブル分析結果を出力するリソースです。

早速K8sGPTリソースをデプロイしてK8sGPTを設定します。

実行コマンド.

kubectl apply -n k8sgpt -f - << EOF
apiVersion: core.k8sgpt.ai/v1alpha1
kind: K8sGPT
metadata:
  name: k8sgpt-amazonbedrock
spec:
  ai:
    enabled: true
    model: anthropic.claude-3-5-sonnet-20240620-v1:0
    backend: amazonbedrock  
    secret:
     name: bedrock-sample-secret
    region: us-east-1
  noCache: false
  repository: ghcr.io/k8sgpt-ai/k8sgpt
  version: v0.3.24
EOF
kubectl get k8sgpts -n k8sgpt

spec.ai.modelに先程登録したものと同じ基盤モデルのモデルIDを、spec.ai.backendにも先程登録したものと同じamazonbedrockを設定します。

spec.ai.secret.nameにはBedrockにアクセスするためのIAMアクセスキーを登録したSecretオブジェクトを設定します。私は以下で作成しました。

実行コマンド.

kubectl create secret generic bedrock-sample-secret --from-literal=AWS_ACCESS_KEY_ID="$(echo $AWS_ACCESS_KEY_ID)" --from-literal=AWS_SECRET_ACCESS_KEY="$(echo $AWS_SECRET_ACCESS_KEY)" -n k8sgpt

セキュリティ的にはEKS Pod IdentityやIAM Roles for Service Accounts (IRSA)を使ってIAMアクセスキーをクラスタ内部に保管しない様にした方が望ましいですが、今回は検証のためこの様なやり方にしています。

コマンド実行例.

[cloudshell-user@ip-10-134-61-137 ~]$ kubectl apply -n k8sgpt -f - << EOF
> apiVersion: core.k8sgpt.ai/v1alpha1
> kind: K8sGPT
> metadata:
>   name: k8sgpt-amazonbedrock
> spec:
>   ai:
>     enabled: true
>     model: anthropic.claude-3-5-sonnet-20240620-v1:0
>     backend: amazonbedrock  
>     secret:
>      name: bedrock-sample-secret
>     region: us-east-1
>   noCache: false
>   repository: ghcr.io/k8sgpt-ai/k8sgpt
>   version: v0.3.24
> EOF
k8sgpt.core.k8sgpt.ai/k8sgpt-amazonbedrock created
[cloudshell-user@ip-10-134-61-137 ~]$ kubectl get k8sgpts -n k8sgpt
NAME                   AGE
k8sgpt-amazonbedrock   17s

リソース作成できましたね!

分析結果はResultオブジェクトから確認できます。

実行コマンド.

kubectl get result -n k8sgpt -o jsonpath='{.items[].spec}' | jq .

コマンド実行例.

[cloudshell-user@ip-10-134-61-137 ~]$ kubectl get result -n k8sgpt -o jsonpath='{.items[].spec}' | jq .
{
  "backend": "amazonbedrock",
  "details": " Unfortunately I do not have enough context to fully understand your command. However, it seems you are asking me to stop or cancel pulling a Docker image called \"hogehoge\". As an AI assistant I do not have direct access to Docker or the ability to control image pulls. I suggest checking the Docker documentation on how to cancel an image pull if that is what you intended.",
  "error": [
    {
      "text": "Back-off pulling image \"hogehoge\""
    }
  ],
  "kind": "Pod",
  "name": "default/problem-pod",
  "parentObject": ""
}

ちゃんと先程作成した問題Podが検出されていますね、良かったです!

まとめ

今回はK8sGPTという生成AIを使ったKubernetesクラスタのトラブル分析OSSを使って、セットアップとサンプルのトラブル解析を試してみました。
Kubernetesのトラブルシューティングは複雑になりがちで、ある程度経験を積まないと難しいと思う面もありますが、K8sGPTを使う事で初心者でも効率良くトラブルシューティングできる様になるのではと思いました。
結構お手軽に使用できましたので、実際のKubernetesクラスタを使用したシステム運用の現場でも導入できそうです。

ではまた!

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up