KubernetesのネットワークオブザーバビリティプラットフォームRetinaを試してみた＜後編：キャプチャ使いこなし編＞

Last updated at 2024-10-30Posted at 2024-10-30

はじめに

三菱電機のノザワです。

本記事は、KubernetesクラスターのネットワークオブザーバビリティプラットフォームであるRetinaについてまとめた記事の後編です。
キャプチャに関してRetinaの各機能を具体的に触ってみた内容を取り上げます。
今回は、社内プロキシの関係で少し苦戦した内容もお話します。

Retinaとは何か調査した内容や手元のオンプレKubernetesクラスター上で環境構築した内容は前編に、Retinaにおけるメトリクスについて解説した内容は中編にまとめていますので、あわせてお読みください。

本記事で触れる内容

Retinaのキャプチャ機能
社内プロキシ環境下での格闘話

本記事で触れない内容

Kubernetesの基礎知識
kubectlやhelmコマンドの使い方

Retinaとは(おさらい)

今回取り上げるRetinaは、Microsoft社が2024年3月19日にOSSとして公開したKubernetesクラスターのネットワークオブザーバビリティツールです。

執筆時の2024年10月30日時点での最新バージョンは、v0.0.16です。
後編の執筆に時間がかかっているうちに、ドキュメントが充実し、ロゴも新しくなっていました。

本記事では、原則として前編、中編同様にv0.0.12を取り扱いますが、ソースコードや公式ドキュメントから入手した情報はv0.0.16の場合があります。

環境

本記事は、以下の環境での動作内容をもとに執筆しています。

OS：Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Linuxカーネル：v6.5.0
Kubernetes：v1.29.5
CNI：Calico v3.25.0
Retina：v0.0.12

また、記事後半の話題にも関連しますが、インターネットへのアクセスには明示型プロキシを経由する必要があるネットワーク環境で作業します。

キャプチャ機能の概要

キャプチャ機能は、ネットワークメトリクスの提供と並んでRetinaの主要な機能の1つです。
オンデマンドで対象ノードまたはPodを指定して通信パケットをキャプチャすることができます。
パケットのキャプチャの他にもネットワークのインターフェースなどのメタデータを収集します。
これらがコマンド1つで手軽に実行できるのは便利です。

キャプチャ結果は、キャプチャ実行時に指定した出力先にまとめて保管されます。

キャプチャの仕組み

Retinaでキャプチャを実行すると、Kubernetesのジョブが作成されます。
ジョブは実行後もリソースとして残るため、後からキャプチャ実行時のログを確認しやすいのはメリットと言えます。
ちなみに、retina-agentやretina-operatorなどRetina自体のワークロードを削除しても、ジョブは明示的に削除しない限り残ります。
再デプロイ時などにワークロードを一旦削除することがあっても、キャプチャ結果が残るのは安心です。

では、このキャプチャを実行した時に具体的にどんなプログラムが走るのでしょうか。
答えはジョブのログを見るとすぐに分かるのですが、Retinaの専用プログラムではありません。
実は、Linuxで広く使われているtcpdumpが実行されます。

キャプチャジョブのログ

ts=2024-09-09T09:47:56.802Z level=info caller=captureworkload/main.go:34 msg="Start to capture network traffic"
ts=2024-09-09T09:47:56.802Z level=info caller=captureworkload/main.go:35 msg="Version: " version=v0.0.12
ts=2024-09-09T09:47:56.841Z level=info caller=provider/network_capture_unix.go:46 msg="Created temporary folder for network capture" capture temporary folder=/tmp/capture-test-kredit-20240909094756UTC
ts=2024-09-09T09:47:56.897Z level=info caller=provider/network_capture_unix.go:95 msg="Running tcpdump with args" tcpdump command="/usr/bin/tcpdump -w /tmp/capture-test-kredit-20240909094756UTC/capture-test-kredit-20240909094756UTC.pcap --relinquish-privileges=root" tcpdump args="tcpdump,-w,/tmp/capture-test-kredit-20240909094756UTC/capture-test-kredit-20240909094756UTC.pcap,--relinquish-privileges=root"

つまりRetinaは、Kubernetesクラスター内でtcpdumpを実行したい時に、名前、保存場所、パラメータなどの各種設定を良きに計らってくれる役割を果たしてくれるのです。
実行したスキャンジョブは後からリストで確認することができるので、管理が楽になりますね。

ちなみに、Windowsではどのようにキャプチャを実行しているかというと、netsh traceコマンドを実行しているようです。

キャプチャが実行されるまでの流れは、公式ドキュメントでアーキテクチャ図が用意されています。
Retinaとkube-apiserverの関係が分かるかと思います。

画像出典元：https://retina.sh/docs/Captures/

キャプチャの実行

キャプチャの実行をRetinaに指示する方法は、以下の2通りがあります。

CLIによる方法
Kubernetesのカスタムリソース定義(CRD)による方法

この辺りは前編の記事でも紹介している部分もありますが、再掲です。
本記事では、もう少し詳しく見ていきます。

CLIによる方法

環境構築でkrewを使用してRetinaのCLIを導入済であることを前提に話を進めます。
なお、kubectl-retinaコマンドはkubectl retinaコマンドと読み替えられます。

キャプチャにはcaptureコマンドを使用します。
サブコマンドとして、以下の4コマンドがあります。

create
delete
download
list

参考までにヘルプメッセージを載せておきます(フラグ部分は省略)。

ヘルプメッセージ

# kubectl-retina capture --help
capture network traffic

Usage:
   capture [command]

Available Commands:
  create      create a Retina Capture
  delete      Delete a Retina capture
  download    Download Retina Captures
  list        List Retina Captures

キャプチャの作成(`create`)

機能のメインとなる、新しくキャプチャを作成するコマンドです。
オプションでキャプチャ名やキャプチャの対象、結果の出力先を指定します。
作成後にキャプチャを実行するジョブが作成されます。

kubectl-retina capture create --name capture-test --host-path /mnt/path/to/save

ヘルプメッセージで出力されるコマンドの実行例が充実しているので、参考になると思います。

ヘルプメッセージ

# kubectl-retina capture create --help
create a Retina Capture

Usage:
   capture create [flags]

Examples:
  # Capture network packets on the node selected by node names and copy the artifacts to the node host path /mnt/capture
  kubectl retina capture create --host-path /mnt/capture --namespace capture --node-names "aks-nodepool1-41844487-vmss000000,aks-nodepool1-41844487-vmss000001"

  # Capture network packets on the coredns pods determined by pod-selectors and namespace-selectors
  kubectl retina capture create --host-path /mnt/capture --namespace capture --pod-selectors="k8s-app=kube-dns" --namespace-selectors="kubernetes.io/metadata.name=kube-system"

  # Capture network packets on nodes with label "agentpool=agentpool" and "version:v20"
  kubectl retina capture create --host-path /mnt/capture --node-selectors="agentpool=agentpool,version:v20"

  # Capture network packets on nodes using node-selector with duration 10s
  kubectl retina capture create --host-path=/mnt/capture --node-selectors="agentpool=agentpool" --duration=10s

  # Capture network packets on nodes using node-selector and upload the artifacts to blob storage with SAS URL https://testaccount.blob.core.windows.net/<token>
  kubectl retina capture create --node-selectors="agentpool=agentpool" --blob-upload=https://testaccount.blob.core.windows.net/<token>

  # Capture network packets on nodes using node-selector and upload the artifacts to AWS S3
  kubectl retina capture create --node-selectors="agentpool=agentpool" \
  --s3-bucket "your-bucket-name" \
  --s3-region "eu-central-1"\
  --s3-access-key-id "your-access-key-id" \
  --s3-secret-access-key "your-secret-access-key"

  # Capture network packets on nodes using node-selector and upload the artifacts to S3-compatible service (like MinIO)
  kubectl retina capture create --node-selectors="agentpool=agentpool" \
  --s3-bucket "your-bucket-name" \
  --s3-endpoint "https://play.min.io:9000" \
  --s3-access-key-id "your-access-key-id" \
  --s3-secret-access-key "your-secret-access-key"

よく指定すると思われるオプションをまとめておきます。

オプション名	概要
`name`	キャプチャ名。
`duration`	キャプチャ時間。デフォルトは1分。
`node-names`	キャプチャ対象のノード名(リスト形式可)。
`node-selectors`	キャプチャ対象のnodeSelector(リスト形式可)。
`pod-selectors`	キャプチャ対象のpodSelector(リスト形式可)。
`tcpdump-filter`	キャプチャ時のフィルター条件(`tcpdump`の形式)。
`host-path`	結果出力先としてhostPath指定時の保存先パス。
`pvc`	結果出力先として永続ボリューム(PV)指定時のPVC名。
`s3-bucket`	結果出力先としてS3指定時のS3バケット名。
`s3-region`	結果出力先としてS3指定時のS3のリージョン名。
`s3-access-key-id`	結果出力先としてS3指定時のS3アクセス用アクセスキーID。
`s3-secret-access-key`	結果出力先としてS3指定時のS3アクセス用シークレットアクセスキー。
`blob-upload`	結果出力先としてAzureストレージアカウント指定時のBlobのSAS URL。

その他のオプションもドキュメントに詳しく説明があるので、お読みください。
(記事を書くのに時間がかかっている間にかなり更新されていました。)

本記事の後半で出力先ごとの動作を検証します。

キャプチャの削除(`delete`)

作成したキャプチャを削除するコマンドです。
オプションで削除したいキャプチャ名を指定します。
キャプチャジョブ単位ではなく、キャプチャ単位の削除になります。
また、Kubernetesとしてのジョブも削除されます。

# kubectl-retina capture delete --name capture-test-for-qiita-to-delete
ts=2024-10-22T20:14:30.879+0900 level=info caller=capture/delete.go:79 msg="Retina Capture \"capture-test-for-qiita-to-delete\" delete"

キャプチャのダウンロード(`download`)

オブジェクトストレージに保管されている過去に実行したキャプチャの結果をローカルにダウンロードするコマンドのようです。
「ようです」というのは、実は公式ドキュメントにまだ記載がなく、コマンドのヘルプメッセージとソースコードぐらいの情報しかないからです。

# kubectl-retina capture download --help
Download Retina Captures

Usage:
   capture download [flags]
(以下略)

ソースコードを読む限り、現時点で取得元はAzureのストレージアカウントのみに対応しているようです。

試しに実行してみると、以下のように怒られました。

# kubectl-retina capture download --name capture-test
Error: BLOB_URL must be set/exported

どうやら取得元のURLをBLOB_URLに設定しておく必要があるようです。
肝心のその値の設定方法ですが、環境変数としてBLOB_URLを設定しても同じ表示が出てしまいました。
また、これまで触れていない方法ですが、kubectl-retina config setコマンドでBLOB_URLの値を設定しても同じでした。

この部分で使われているGoのライブラリviperは、configファイルや環境変数、フラグオプションなどから値を読み込んでくれるようです。
ただし、今回はいずれもうまくいきませんでした。

内部のダウンロード処理自体は実装されているようですが、設定値を渡す部分がうまくできていないようです。
また、値の渡し方が他のコマンドと違うのも少し気になります。
このコマンドに関しては、まだ未完成かもしれません。
正常に動作させることができた方は、ぜひコメントでお寄せください。

キャプチャの一覧表示(`list`)

作成したキャプチャを一覧表示するコマンドです。
一覧表示にもかかわらず、現時点でのバージョンでは、なぜかnameオプションが求められます。

# kubectl-retina capture list --name ""
NAMESPACE   CAPTURE NAME      JOBS                                                                COMPLETIONS   AGE
default     retina-test-pvc   retina-test-pvc-98wn8,retina-test-pvc-f2h98,retina-test-pvc-wm6jn   3/3           104d
default     es-capture        es-capture-8ltj4,es-capture-ftfn7,es-capture-sfrmq                  3/3           132d
default     retina-test       retina-test-h25gd,retina-test-vk99z,retina-test-xngx2               3/3           104d

別々のタイミングでcreateコマンドで作成したキャプチャであっても、同じキャプチャ名の場合はまとめて1つのキャプチャとしてリストに表示されます。

CRDによる方法

CLIで逐一コマンドを実行する他に、キャプチャの設定をYAMLファイル等に記述して、CRDとしてデプロイする方法もあります。
キャプチャの設定をファイルとして管理できるので、バージョン管理がしやすいかもしれません。

本記事では、紹介程度に留めます。
公式ドキュメントからの引用ですが、記載内容はこんな感じです。

crd.yaml

apiVersion: retina.sh/v1alpha1
kind: Capture
metadata:
  name: example-capture
spec:
  captureConfiguration:
    captureOption:
      duration: "30s"
      maxCaptureSize: 100
      packetSize: 1500
    captureTarget:
      namespaceSelector:
        matchLabels:
          app: target-app
  outputConfiguration:
    hostPath: /captures
    blobUpload: blob-sas-url
    s3Upload:
      bucket: retina-bucket
      region: ap-northeast-2
      path: retina/captures
      secretName: capture-s3-upload-secret
---
apiVersion: v1
kind: Secret
metadata:
  name: capture-s3-upload-secret
data:
  s3-access-key-id: <based-encode-s3-access-key-id>
  s3-secret-access-key: <based-encode-s3-secret-access-key>

引用元：https://retina.sh/docs/Concepts/CRDs/Capture

保存先の種類

保存先の種類を見ていきましょう。
Retinaは保存先として大きく以下の3つに対応しています。

KubernetesのhostPath
Kubernetesの永続ボリューム(PV)
オブジェクトストレージ

順番に試してみます。
オブジェクトストレージはAmazon S3を例に試しますが、他にもAzure StorageやS3互換のサービス(MinIOなど)にも対応しています。

hostPath

まずは、hostPathを試してみます。
hostPathなので、キャプチャ結果はキャプチャを実行したノードの指定されたパスに保存されます。

hostPathを使用してキャプチャを作成する時は、以下のオプションを指定します。

host-path

今回は、Ubuntuが動いているノードを対象に、DNSクエリをキャプチャしてみました。
なお、キャプチャ対象となるノードは3つあります。

# kubectl-retina capture create --name capture-test-for-qiita --host-path /mnt/capture --node-selectors
 "kubernetes.io/os-dist=ubuntu" --tcpdump-filter "udp port 53"
ts=2024-10-22T16:18:05.893+0900 level=info caller=capture/create.go:243 msg="The capture duration is set to 1m0s"
ts=2024-10-22T16:18:05.894+0900 level=info caller=capture/create.go:289 msg="The capture file max size is set to 100MB"
ts=2024-10-22T16:18:05.894+0900 level=info caller=utils/capture_image.go:56 msg="Using capture workload image ghcr.io/microsoft/retina/retina-agent:v0.0.12 with version determined by CLI version"
ts=2024-10-22T16:18:05.896+0900 level=info caller=capture/crd_to_job.go:201 msg="HostPath is not empty" HostPath=/mnt/capture
ts=2024-10-22T16:18:05.991+0900 level=info caller=capture/crd_to_job.go:876 msg="The Parsed tcpdump filter is \"\""
ts=2024-10-22T16:18:06.021+0900 level=info caller=capture/create.go:369 msg="Packet capture job is created" namespace=default capture job=capture-test-for-qiita-n8fps
ts=2024-10-22T16:18:06.039+0900 level=info caller=capture/create.go:369 msg="Packet capture job is created" namespace=default capture job=capture-test-for-qiita-22gd2
ts=2024-10-22T16:18:06.050+0900 level=info caller=capture/create.go:369 msg="Packet capture job is created" namespace=default capture job=capture-test-for-qiita-pv77z
ts=2024-10-22T16:18:06.051+0900 level=info caller=capture/create.go:125 msg="Please manually delete all capture jobs"
NAMESPACE   CAPTURE NAME             JOBS                                                                                     COMPLETIONS   AGE
default     capture-test-for-qiita   capture-test-for-qiita-22gd2,capture-test-for-qiita-n8fps,capture-test-for-qiita-pv77z   0/3           0s

数分経過したので、指定したパスをのぞいてみましょう。
結果をまとめたアーカイブファイルが格納されているはずです。

$ ls /mnt/capture -l
(中略)
-rw-r--r-- 1 root root    80814 10月 22 16:20 capture-test-for-qiita-kredit-20241022071847UTC.tar.gz

hostPathでは、キャプチャを実行したノードごとに結果が保存されます。
他のノードの結果を確認するには、そのノードにアクセスする必要があります。

tarファイルの中身をのぞいてみます。

$ tar ztfv /mnt/capture/capture-test-for-qiita-kredit-20241022071847UTC.tar.gz
drwxr-x--- root/root         0 2024-10-22 16:19 .
-rw-r--r-- root/root     51088 2024-10-22 16:19 capture-test-for-qiita-kredit-20241022071847UTC.pcap
-rw-r--r-- root/root      9190 2024-10-22 16:19 ip-resources.txt
-rw-r--r-- root/root      4660 2024-10-22 16:19 iptables-rules.txt
dr-xr-xr-x root/root         0 2024-10-22 16:19 proc-net
-r--r--r-- root/root         0 2024-10-22 16:19 proc-net/anycast6
-r--r--r-- root/root      1558 2024-10-22 16:19 proc-net/arp
(中略)
dr-xr-xr-x root/root         0 2024-10-22 16:19 proc-sys-net
dr-xr-xr-x root/root         0 2024-10-22 16:19 proc-sys-net/bridge
-rw-r--r-- root/root         2 2024-10-22 16:19 proc-sys-net/bridge/bridge-nf-call-arptables
(中略)
-rw-r--r-- root/root    173297 2024-10-22 16:19 socket-stats.txt
-rw-r--r-- root/root       358 2024-10-22 16:19 tcpdump.log

ipコマンドやiptablesコマンドの実行結果、/proc/netディレクトリと/proc/sys/netディレクトリをごっそりコピーしてメタデータを集めているので、ファイルが大量にあります。

パケットキャプチャ結果は、pcapファイルに記録されています。
今回の実行例だと、ファイルはcapture-test-for-qiita-kredit-20241022071847UTC.pcapです。

pcapファイルはWiresharkで開けるので、中身をのぞいてみましょう。
画像のように、DNSクエリとそのレスポンスだけがしっかりとキャプチャされています。

DNSクエリだけでは味気ないので、別に実行したキャプチャも示します。
クラスター内でいろいろなサービスからパケットが飛び交っていることが分かりますね。

また、参考までに、キャプチャの実行結果サマリーが記載されたtcpdump.logファイルの中身も掲載しておきます。
今回のキャプチャでは、300パケット近くキャプチャできたようです。

tcpdump.log

/usr/bin/tcpdump -w /tmp/capture-test-for-qiita-kredit-20241022071847UTC/capture-test-for-qiita-kredit-20241022071847UTC.pcap --relinquish-privileges=root udp port 53

dropped privs to root
tcpdump: listening on ens160, link-type EN10MB (Ethernet), snapshot length 262144 bytes
288 packets captured
299 packets received by filter
0 packets dropped by kernel

残りのコマンド実行結果を記録したファイルも冒頭部分を示しておきます。

ip-resources.txt

Summary:

/usr/bin/ip -d -j addr show(IP address configuration)
/usr/bin/ip -d -j neighbor show(IP neighbor status)
/usr/bin/ip rule list(Policy routing list)
/usr/bin/ip route show table all(Routes of all route tables)

Execute:

/usr/bin/ip -d -j addr show
(以下略)

iptables-rules.txt

Summary:

/usr/bin/iptables-legacy-save(IPtables rules)
/usr/bin/iptables-legacy -vnx -L(IPtables rules and stats in filter table)
/usr/bin/iptables-legacy -vnx -L -t nat(IPtables rules and stats in nat table)
/usr/bin/iptables-legacy -vnx -L -t mangle(IPtables rules and stats in mangle table)

Execute:

/usr/bin/iptables-legacy-save
(以下略)

socket-stats.txt

Summary:

/usr/bin/ss -s(Socket statistics summary)
/usr/bin/ss -tapionume(Socket statistics details)

Execute:

/usr/bin/ss -s
(以下略)

永続ボリューム(PV)

続いて、永続ボリューム(PV)への保存を試してみます。
この場合は、キャプチャ結果はKubernetesクラスター内のPVに保管されます。

PVを使用してキャプチャを作成する時は、以下のオプションを指定します。

pvc

早速、キャプチャを作成してみましょう。

# kubectl-retina capture create --name capture-test-for-qiita-pvc --pvc retina-qiita-pvc --node-selectors "kubernetes.io/os-dist=ubuntu"
ts=2024-10-22T16:54:06.633+0900 level=info caller=capture/create.go:243 msg="The capture duration is set to 1m0s"
ts=2024-10-22T16:54:06.633+0900 level=info caller=capture/create.go:289 msg="The capture file max size is set to 100MB"
ts=2024-10-22T16:54:06.633+0900 level=info caller=utils/capture_image.go:56 msg="Using capture workload image ghcr.io/microsoft/retina/retina-agent:v0.0.12 with version determined by CLI version"
ts=2024-10-22T16:54:06.634+0900 level=info caller=capture/crd_to_job.go:288 msg="PersistentVolumeClaim is not empty" PersistentVolumeClaim=retina-qiita-pvc
ts=2024-10-22T16:54:06.655+0900 level=error caller=capture/create.go:121 msg="Failed to create job" error="failed to get pvc default/retina-qiita-pvc"
Error: failed to get pvc default/retina-qiita-pvc

おっと、オプションで指定したPVCのretina-qiita-pvcが見つからないと怒られてしまいました……。
Retinaは、PVCが存在しない時に自動で作ってくれるまで手厚いわけではないようです。
仕方なくPVCは自分で作ることにします。
以下のYAMLファイルをkubectl applyすれば、PVCが作成されるはずです。

pvc_retina.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: retina-qiita-pvc
spec:
  accessModes:
    - ReadWriteMany
  volumeMode: Filesystem
  resources:
    requests:
      storage: 8Gi

もう一度キャプチャを作成してみます。

# kubectl-retina capture create --name capture-test-for-qiita-pvc --pvc retina-qiita-pvc --node-selectors "kubernetes.io/os-dist=ubuntu"
ts=2024-10-22T17:00:24.197+0900 level=info caller=capture/create.go:243 msg="The capture duration is set to 1m0s"
ts=2024-10-22T17:00:24.197+0900 level=info caller=capture/create.go:289 msg="The capture file max size is set to 100MB"
ts=2024-10-22T17:00:24.197+0900 level=info caller=utils/capture_image.go:56 msg="Using capture workload image ghcr.io/microsoft/retina/retina-agent:v0.0.12 with version determined by CLI version"
ts=2024-10-22T17:00:24.199+0900 level=info caller=capture/crd_to_job.go:288 msg="PersistentVolumeClaim is not empty" PersistentVolumeClaim=retina-qiita-pvc
ts=2024-10-22T17:00:24.321+0900 level=info caller=capture/crd_to_job.go:876 msg="The Parsed tcpdump filter is \"\""
ts=2024-10-22T17:00:24.353+0900 level=info caller=capture/create.go:369 msg="Packet capture job is created" namespace=default capture job=capture-test-for-qiita-pvc-vgf96
ts=2024-10-22T17:00:24.373+0900 level=info caller=capture/create.go:369 msg="Packet capture job is created" namespace=default capture job=capture-test-for-qiita-pvc-ccbqk
ts=2024-10-22T17:00:24.391+0900 level=info caller=capture/create.go:369 msg="Packet capture job is created" namespace=default capture job=capture-test-for-qiita-pvc-k7k82
ts=2024-10-22T17:00:24.392+0900 level=info caller=capture/create.go:125 msg="Please manually delete all capture jobs"
NAMESPACE   CAPTURE NAME                 JOBS                                                                                                 COMPLETIONS   AGE
default     capture-test-for-qiita-pvc   capture-test-for-qiita-pvc-ccbqk,capture-test-for-qiita-pvc-k7k82,capture-test-for-qiita-pvc-vgf96   0/3           0s

今度は無事にキャプチャを作成することができました。

このKubernetesクラスターにおけるPVの実態はノードで動かしているNFSなので、中身をのぞいてみます。

$ ls /nfs/pv0006/ -l
(中略)
-rw-r--r--  1 root root 27336172 10月 22 17:03 capture-test-for-qiita-pvc-kredit-20241022080106UTC.tar.gz
-rw-r--r--  1 root root  5426752 10月 22 17:02 capture-test-for-qiita-pvc-nerv-20241022080009UTC.tar.gz
-rw-r--r--  1 root root 38297694 10月 22 17:03 capture-test-for-qiita-pvc-wunder-20241022080107UTC.tar.gz

当然と言えば当然ですが、hostPathと違って、PVを指定した場合は各ノードのキャプチャ結果が1か所に保存されています。
各結果の中身はhostPathと同様なので、紹介は省略します。

S3

最後にオブジェクトストレージについても試してみましょう。
開発元がMicrosoftなだけあって、Azureのストレージアカウントへの保存が一番楽なようですが、S3へのアップロードもサポートされています。

S3を使用してキャプチャを作成する時は、以下のオプションを指定します。
S3の場合は認証情報も合わせて渡す必要があるので、指定するオプションが多めです。

s3-bucket
s3-region
s3-access-key-id
s3-secret-access-key

S3互換のオブジェクトストレージサービスにも対応しています。
例えばMinIOなどです。

本記事では、S3側のバケットポリシーなどのアクセス権の設定については言及しません。

一通りオプションを指定して、キャプチャを作成してみます。
S3の認証情報はシークレットを作成して保存されているようです。

# kubectl-retina capture create --name capture-test-for-qiita-s3 --s3-bucket retina-upload-sample-bucket --s3-region ap-northeast-1 --s3-access-key-id AKIAIOSFODNN7EXAMPLE --s3-secret-access-key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY --node-selectors "kubernetes.io/os-dist=ubuntu"
ts=2024-10-22T20:40:10.456+0900 level=info caller=capture/create.go:243 msg="The capture duration is set to 1m0s"
ts=2024-10-22T20:40:10.456+0900 level=info caller=capture/create.go:289 msg="The capture file max size is set to 100MB"
ts=2024-10-22T20:40:10.501+0900 level=info caller=utils/capture_image.go:56 msg="Using capture workload image ghcr.io/microsoft/retina/retina-agent:v0.0.12 with version determined by CLI version"
ts=2024-10-22T20:40:10.508+0900 level=info caller=capture/crd_to_job.go:256 msg="S3Upload is not empty"
ts=2024-10-22T20:40:10.650+0900 level=info caller=capture/crd_to_job.go:876 msg="The Parsed tcpdump filter is \"\""
ts=2024-10-22T20:40:10.666+0900 level=info caller=capture/create.go:369 msg="Packet capture job is created" namespace=default capture job=capture-test-for-qiita-s3-dqg6m
ts=2024-10-22T20:40:10.680+0900 level=info caller=capture/create.go:369 msg="Packet capture job is created" namespace=default capture job=capture-test-for-qiita-s3-rdwsk
ts=2024-10-22T20:40:10.709+0900 level=info caller=capture/create.go:369 msg="Packet capture job is created" namespace=default capture job=capture-test-for-qiita-s3-8p9r4
ts=2024-10-22T20:40:10.709+0900 level=info caller=capture/create.go:125 msg="Please manually delete all capture jobs"
ts=2024-10-22T20:40:10.712+0900 level=info caller=capture/create.go:130 msg="Please manually delete capture secret" namespace=default secret name=capture-s3-upload-secret2sfp2
NAMESPACE   CAPTURE NAME                JOBS                                                                                              COMPLETIONS   AGE
default     capture-test-for-qiita-s3   capture-test-for-qiita-s3-8p9r4,capture-test-for-qiita-s3-dqg6m,capture-test-for-qiita-s3-rdwsk   0/3           0s

上の実行例は、実際に指定した値を書き換えています。

さて、うまくいったか、ジョブのログを見てみましょう。(嫌な予感)

ts=2024-10-22T11:41:08.420Z level=info caller=provider/network_capture_unix.go:353 msg="Done for collecting network metadata"
ts=2024-10-22T11:41:30.380Z level=info caller=outputlocation/s3.go:83 msg="Upload capture file to s3" location=S3Upload source file path=/tmp/capture-test-for-qiita-s3-nerv-20241022113955UTC.tar.gz bucketName=retina-upload-sample-bucket objectKey=retina/captures/tmp/capture-test-for-qiita-s3-nerv-20241022113955UTC.tar.gz
ts=2024-10-22T11:43:02.721Z level=error caller=outputlocation/s3.go:111 msg="Couldn't upload file" srcFilePath=/tmp/capture-test-for-qiita-s3-nerv-20241022113955UTC.tar.gz bucketName=retina-upload-sample-bucket objectKey=retina/captures/tmp/capture-test-for-qiita-s3-nerv-20241022113955UTC.tar.gz error="failed to upload file to S3: operation error S3: PutObject, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , HostID: , request send failed, Put \"https://retina-upload-sample-bucket.s3.ap-northeast-1.amazonaws.com/retina/captures/tmp/capture-test-for-qiita-s3-nerv-20241022113955UTC.tar.gz?x-id=PutObject\": dial tcp 52.219.150.178:443: i/o timeout"
ts=2024-10-22T11:43:02.726Z level=info caller=captureworkload/main.go:70 msg="Done for capturing network traffic"
ts=2024-10-22T11:43:02.726Z level=info caller=provider/network_capture_unix.go:359 msg="Cleanup network capture" capture name=capture-test-for-qiita-s3 temporary dir=/tmp/capture-test-for-qiita-s3-nerv-20241022113955UTC

failed to upload file to S3

はい、案の定怒られてしまいました。
社外へのアクセスが必要なので予想はしていましたが、アップロードに失敗しています。
まあ、プロキシの情報を教えていないんだから当然と言えば当然です。

ちなみに、アップロードに失敗してもジョブの状態はCompletedになります。

唐突かつ予想された社内プロキシとの闘い

ここで簡単に引き下がるのも癪なので、どうにかS3アップロードまで見届けたくなってきました。
プロキシの問題は、およそ社内ネットワーク下の開発で起こる問題で度々苦労していますが、今回も何とか乗り越えたいところです。
よろしければ、試行錯誤の過程を最後まで見届けていただければと思います。

プロキシを指定できるオプションはある？

この手段が有効なら、一番簡単に解決できます。
Retina側でプロキシの設定を受け付けてよしなにやってくれるようなオプションがあればよいのですが。

フラグオプションを見てみましょう。
長いのでたたんでおきます。

フラグオプション

Flags:
      --blob-upload string            Blob SAS URL with write permission to upload capture files
      --debug                         When debug is true, a customized retina-agent image, determined by the environment variable RETINA_AGENT_IMAGE, is set
      --duration duration             Duration of capturing packets (default 1m0s)
      --exclude-filter string         A comma-separated list of IP:Port pairs that are excluded from capturing network packets. Supported formats are IP:Port, IP, Port, *:Port, IP:*
  -h, --help                          help for create
      --host-path string              HostPath of the node to store the capture files
      --include-filter string         A comma-separated list of IP:Port pairs that are used to filter capture network packets. Supported formats are IP:Port, IP, Port, *:Port, IP:*
      --include-metadata              If true, collect static network metadata into capture file (default true)
      --job-num-limit int             The maximum number of jobs can be created for each capture. 0 means no limit
      --max-size int                  Limit the capture file to MB in size which works only for Linux (default 100)
      --namespace-selectors string    A comma-separated list of namespace labels in which to apply the pod-selectors. By default, the pod namespace is specified by the flag namespace
      --no-wait                       Do not wait for the long-running capture job to finish (default true)
      --node-names string             A comma-separated list of node names to select nodes on which the network capture will be performed
      --node-selectors string         A comma-separated list of node labels to select nodes on which the network capture will be performed
      --packet-size int               Limits the each packet to bytes in size which works only for Linux
      --pod-selectors string          A comma-separated list of pod labels to select pods on which the network capture will be performed
      --pvc string                    PersistentVolumeClaim under the specified or default namespace to store capture files
      --s3-access-key-id string       S3 access key id to upload capture files
      --s3-bucket string              Bucket in which to store capture files
      --s3-endpoint string            Endpoint for an S3 compatible storage service. Use this if you are using a custom or private S3 service that requires a specific endpoint
      --s3-path string                Prefix path within the S3 bucket where captures will be stored (default "retina/captures")
      --s3-region string              Region where the S3 compatible bucket is located
      --s3-secret-access-key string   S3 access secret key to upload capture files
      --tcpdump-filter string         Raw tcpdump flags which works only for Linux

Global Flags:
      --as string                      Username to impersonate for the operation. User could be a regular user or a service account in a namespace.
      --as-group stringArray           Group to impersonate for the operation, this flag can be repeated to specify multiple groups.
      --as-uid string                  UID to impersonate for the operation.
      --cache-dir string               Default cache directory (default "/root/.kube/cache")
      --certificate-authority string   Path to a cert file for the certificate authority
      --client-certificate string      Path to a client certificate file for TLS
      --client-key string              Path to a client key file for TLS
      --cluster string                 The name of the kubeconfig cluster to use
      --context string                 The name of the kubeconfig context to use
      --disable-compression            If true, opt-out of response compression for all requests to the server
      --insecure-skip-tls-verify       If true, the server's certificate will not be checked for validity. This will make your HTTPS connections insecure
      --kubeconfig string              Path to the kubeconfig file to use for CLI requests.
      --name string                    The name of the Retina Capture
  -n, --namespace string               If present, the namespace scope for this CLI request
      --request-timeout string         The length of time to wait before giving up on a single server request. Non-zero values should contain a corresponding time unit (e.g. 1s, 2m, 3h). A value of zero means don't timeout requests. (default "0")
  -s, --server string                  The address and port of the Kubernetes API server
      --tls-server-name string         Server name to use for server certificate validation. If it is not provided, the hostname used to contact the server is used
      --token string                   Bearer token for authentication to the API server
      --user string                    The name of the kubeconfig user to use

ざっと見ましたが、残念ながらプロキシの情報を渡せるようなオプションは存在しませんでした。
なお、現時点で最新のv0.0.16でも同様であることを確認しています。

どうやってアップロードしているのだろう？

ふと、RetinaがどのようにS3バケットにファイルをアップロードしているか気になりました。
どのような実装になっているか、ソースコードを見てみましょう。

アップロード(PutObject)を実行している箇所は、以下の通りです。

func (su *S3Upload) Output(srcFilePath string) error {
(中略)
	_, err = s3Client.PutObject(context.TODO(), &s3.PutObjectInput{
		Bucket: aws.String(su.bucket),
		Key:    aws.String(objectKey),
		Body:   s3File,
	})

パッケージとしてgithub.com/aws/aws-sdk-go-v2/service/s3がインポートされていました。
GoのAWS SDKでは、どのようにプロキシを設定すればよいでしょうか。
これに関しては公式ドキュメントに記載がありました。

If you cannot directly connect to the internet, you can use Go-supported environment variables (HTTP_PROXY / HTTPS_PROXY) or create a custom HTTP client to configure your proxy.
引用元：https://aws.github.io/aws-sdk-go-v2/docs/configuring-sdk/custom-http/#configuring-a-proxy

プロキシ設定方法のあるあるですが、環境変数HTTP_PROXYとHTTPS_PROXYを読んでくれるようです。
また、カスタムのHTTPクライアントを作ることでも対応可能なようですが、内部のコードまでいじるのは現実的でないため、今回はどうにか環境変数を渡せないか考えてみます。

ちなみに、Goで実装されているパッケージでHTTP通信するものは、およそnet/httpライブラリを使用していることが多い印象です。
このライブラリでは、環境変数HTTP_PROXYやHTTPS_PROXYからプロキシ情報を取得してくれます。
今回のS3のSDKもこのパターンでした。

ProxyFromEnvironment returns the URL of the proxy to use for a given request, as indicated by the environment variables HTTP_PROXY, HTTPS_PROXY and NO_PROXY (or the lowercase versions thereof).
引用元：https://pkg.go.dev/net/http#ProxyFromEnvironment

環境変数を渡せられればいいのでは？

キャプチャジョブを実行するPodに環境変数を渡せられればよいという道筋が見えてきました。
HelmにあるようなextraEnvの形で環境変数を渡せると良いのですが。
もう一度先ほどのフラグオプションを見てみましょう。

はい……、やはりそんな便利なものはまだありませんね。
一筋縄ではいきません。

何としてもPodに環境変数を渡したい！

ここまで来たら意地です。
何が何でもプロキシ経由でアップロードさせてやります。

でも、現時点のRetinaのCLIの機能では、どうやってもプロキシ情報をジョブに渡す手段はないようです。
ここで少し方針を変更し、ジョブの作成はCLI経由かどうかを問わないことにします。
つまり、同じ処理をするジョブのPodをRetinaのCLIを使わずに手動でデプロイするのもアリとします。

解決に至る糸口、そして

Kubernetesのジョブは一度実行が終了すると、再起動することはできません。
同じ処理を再実行したい場合は、新たにジョブを作成する必要があります。
kubectl editでPodの環境変数をいじればとも一瞬思いましたが、終了したPodを後からいじっても何の効果もありません。

ということで、既存のジョブの定義を流用して、環境変数を追記したものを新たにデプロイしようと思います。
既存のジョブは以下のようにkubectl describeを実行すると、YAML形式のマニュフェストファイルとして出力させることができます。

kubectl get jobs capture-test-for-qiita-s3-8p9r4 -o yaml > retina-s3-job-qiita.yaml

出力された中身はこちらです。
status以下は不要なので、削除しました。
一部の情報はマスキングしてあります。

YAMLファイル(修正前)

retina-s3-job-qiita.yaml

apiVersion: batch/v1
kind: Job
metadata:
  creationTimestamp: "2024-10-22T11:40:10Z"
  generateName: capture-test-for-qiita-s3-
  generation: 1
  labels:
    capture-name: capture-test-for-qiita-s3
    retina.sh/app: capture
  name: capture-test-for-qiita-s3-8p9r4
  namespace: default
  resourceVersion: "89517113"
  uid: b098d917-cefe-4149-91c1-afffffc4a350
spec:
  backoffLimit: 0
  completionMode: NonIndexed
  completions: 1
  manualSelector: false
  parallelism: 1
  podReplacementPolicy: TerminatingOrFailed
  selector:
    matchLabels:
      batch.kubernetes.io/controller-uid: b098d917-cefe-4149-91c1-afffffc4a350
  suspend: false
  template:
    metadata:
      creationTimestamp: null
      labels:
        batch.kubernetes.io/controller-uid: b098d917-cefe-4149-91c1-afffffc4a350
        batch.kubernetes.io/job-name: capture-test-for-qiita-s3-8p9r4
        capture-name: capture-test-for-qiita-s3
        controller-uid: b098d917-cefe-4149-91c1-afffffc4a350
        job-name: capture-test-for-qiita-s3-8p9r4
        retina.sh/app: capture
      namespace: default
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - wunder
      containers:
      - command:
        - ./retina/captureworkload
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: APISERVER
          value: https://XX.XX.XX.XX:6443
        - name: INCLUDE_METADATA
          value: "true"
        - name: S3_ENDPOINT
        - name: S3_REGION
          value: ap-northeast-1
        - name: CAPTURE_DURATION
          value: 1m0s
        - name: S3_BUCKET
          value: retina-upload-sample-bucket
        - name: S3_PATH
          value: retina/captures
        - name: TCPDUMP_RAW_FILTER
        - name: CAPTURE_NAME
          value: capture-test-for-qiita-s3
        - name: CAPTURE_MAX_SIZE
          value: "100"
        - name: NODE_HOST_NAME
          value: wunder
        image: ghcr.io/microsoft/retina/retina-agent:v0.0.12
        imagePullPolicy: IfNotPresent
        name: capture
        resources:
          limits:
            cpu: 100m
            memory: 300Mi
          requests:
            cpu: 10m
            memory: 64Mi
        securityContext:
          capabilities:
            add:
            - NET_ADMIN
            - SYS_ADMIN
          runAsUser: 0
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/s3-upload-secret
          name: capture-s3-upload-secret2sfp2
          readOnly: true
      dnsPolicy: ClusterFirst
      hostIPC: true
      hostNetwork: true
      restartPolicy: Never
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 1800
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoExecute
        operator: Exists
      - effect: NoSchedule
        operator: Exists
      volumes:
      - name: capture-s3-upload-secret2sfp2
        secret:
          defaultMode: 420
          secretName: capture-s3-upload-secret2sfp2

そのまま流用すると既存のリソースと衝突するので、次のように修正します。

ジョブ名衝突回避のために以下の値をcapture-test-for-qiita-s3-manualに変更
- metadata.name
- spec.template.metadata.labels[batch.kubernetes.io/job-name]
- spec.template.metadata.labels[job-name]
spec.template.spec.containers[].env[]に環境変数を追記
以下の項目を削除
- spec.template.metadata.labels[controller-uid]
- spec.template.metadata.labels[batch.kubernetes.io/controller-uid]
- spec.selector

変更後のYAMLファイルはこんな感じです。

YAMLファイル(修正後)

retina-s3-job-qiita.yaml

apiVersion: batch/v1
kind: Job
metadata:
  creationTimestamp: "2024-10-22T11:40:10Z"
  generateName: capture-test-for-qiita-s3-
  generation: 1
  labels:
    capture-name: capture-test-for-qiita-s3
    retina.sh/app: capture
  name: capture-test-for-qiita-s3-manual
  namespace: default
  resourceVersion: "89517113"
  uid: b098d917-cefe-4149-91c1-afffffc4a350
spec:
  backoffLimit: 0
  completionMode: NonIndexed
  completions: 1
  manualSelector: false
  parallelism: 1
  podReplacementPolicy: TerminatingOrFailed
  suspend: false
  template:
    metadata:
      creationTimestamp: null
      labels:
        batch.kubernetes.io/job-name: capture-test-for-qiita-s3-manual
        capture-name: capture-test-for-qiita-s3
        job-name: capture-test-for-qiita-s3-manual
        retina.sh/app: capture
      namespace: default
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - wunder
      containers:
      - command:
        - ./retina/captureworkload
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: APISERVER
          value: https://XX.XX.XX.XX:6443
        - name: INCLUDE_METADATA
          value: "true"
        - name: S3_ENDPOINT
        - name: S3_REGION
          value: ap-northeast-1
        - name: CAPTURE_DURATION
          value: 1m0s
        - name: S3_BUCKET
          value: retina-upload-sample-bucket
        - name: S3_PATH
          value: retina/captures
        - name: TCPDUMP_RAW_FILTER
        - name: CAPTURE_NAME
          value: capture-test-for-qiita-s3
        - name: CAPTURE_MAX_SIZE
          value: "100"
        - name: NODE_HOST_NAME
          value: wunder
        - name: HTTP_PROXY
          value: http://proxy.example.com:1234
        - name: HTTPS_PROXY
          value: http://proxy.example.com:1234
        image: ghcr.io/microsoft/retina/retina-agent:v0.0.12
        imagePullPolicy: IfNotPresent
        name: capture
        resources:
          limits:
            cpu: 100m
            memory: 300Mi
          requests:
            cpu: 10m
            memory: 64Mi
        securityContext:
          capabilities:
            add:
            - NET_ADMIN
            - SYS_ADMIN
          runAsUser: 0
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/s3-upload-secret
          name: capture-s3-upload-secret2sfp2
          readOnly: true
      dnsPolicy: ClusterFirst
      hostIPC: true
      hostNetwork: true
      restartPolicy: Never
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 1800
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoExecute
        operator: Exists
      - effect: NoSchedule
        operator: Exists
      volumes:
      - name: capture-s3-upload-secret2sfp2
        secret:
          defaultMode: 420
          secretName: capture-s3-upload-secret2sfp2

いざ、YAMLファイルをデプロイしてみます。今度こそ……。

# kubectl apply -f retina-s3-job-qiita.yaml aml
job.batch/capture-test-for-qiita-s3-manual created

RetinaのCLIでキャプチャをリスト表示してみました。
手動でデプロイしたジョブもキャプチャジョブとして追加されています。
どうやらPodのメタデータから判断しているようです。
手動でデプロイしたジョブは、CLIが作成したPodを流用していることもあって同じメタデータを持っているため、集計対象となったようです。

# kubectl-retina capture list --name ""
NAMESPACE   CAPTURE NAME                 JOBS                                                                                                                               COMPLETIONS   AGE
default     capture-test-for-qiita-s3    capture-test-for-qiita-s3-8p9r4,capture-test-for-qiita-s3-dqg6m,capture-test-for-qiita-s3-manual,capture-test-for-qiita-s3-rdwsk   3/4           17h

ジョブのログを見てみましょう。

ts=2024-10-23T11:04:08.072Z level=info caller=outputlocation/s3.go:83 msg="Upload capture file to s3" location=S3Upload source file path=/tmp/capture-test-for-qiita-s3-wunder-20241023110206UTC.tar.gz bucketName=retina-upload-sample-bucket objectKey=retina/captures/tmp/capture-test-for-qiita-s3-wunder-20241023110206UTC.tar.gz
ts=2024-10-23T11:04:13.173Z level=info caller=captureworkload/main.go:70 msg="Done for capturing network traffic"
ts=2024-10-23T11:04:13.173Z level=info caller=provider/network_capture_unix.go:359 msg="Cleanup network capture" capture name=capture-test-for-qiita-s3 temporary dir=/tmp/capture-test-for-qiita-s3-wunder-20241023110206UTC

今度はfailed to upload file to S3などと怒られていません。
うまくいったでしょうか。

S3側も見てみます。

ありました！ 遂に成功です！
少し長い道のりでしたが、プロキシ環境下でもS3へのアップロードを~~無理やり~~成功させることができました。

後日談

今回は任意の環境変数を追加できないと判明していろいろ試行錯誤しましたが、ソースコードをのぞいてみると、今後追加される方針ではありそうです。
楽しみですね。

// TODO: more env to be added

また、今回は選びませんでしたが、透過型プロキシを使うアプローチでも解決できそうです。
こちらはそもそもPodにプロキシの情報を渡す必要がありません。
どちらが楽なんでしょうか。

まとめ

今回は、Retinaのキャプチャ機能について、実際に各出力先を試しながら確認してみました。
まだ未成熟なツールということもあって、未完成と思われる一部コマンドやオプション指定方法のばらつきが気になりました。
一方で、キャプチャ機能自体はすでに充実しており、必要な情報は十分に集められる感触です。

出力先に関しても、種類としてはローカルにもクラウドにも保管できるため及第点ではないでしょうか。
ただし、プロキシ環境下などではまだ使いづらい部分があります。

この辺りの不足している機能は、余裕があればコミュニティにコミットしていきたいと思います。

ここまでお読みいただきありがとうございました。
Retinaの機能に関しての記事は今回で終わりですが、小ネタが残っているので番外編を出すかもしれません。
皆さんもぜひ、Retinaを使ってみてください。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up