勉強目的でウェブサイト監視サービス (ミニ Pingdom) を作ってみた

Last updated at 2025-05-04Posted at 2025-05-03

前提

この記事の続き。次はウェブサイト監視サービス (ミニ Pingdom / CloudWatch Synthetics) を作ってみた。

全体像

サービス概要

Go Exporter
- HTTP GET／ICMP ping で各ターゲットをチェック
- 成功回数カウンタ・失敗回数カウンタ・レイテンシヒストグラムを Prometheus 形式で /metrics に公開
Prometheus
- Exporter の /metrics を 15 秒間隔でスクレイプし TSDB に保存 (TSDB は差分ではなくスナップショットを取っている)
Alertmanager
- 「45 秒間成功ナシ」のアラートルールを評価
- Slack へ通知
Grafana
- Prometheus をデータソースに、成功率・レイテンシ推移・障害回数を可視化

成果物

Prometheus 形式で出力されたメトリクス (http://localhost:9183/metrics)
Google さん、お邪魔しました。(8.8.8.8 は HTTP には返答しないがサンプル取った)

...
# HELP uptime_check_duration_seconds Duration of checks in seconds
# TYPE uptime_check_duration_seconds histogram
uptime_check_duration_seconds_bucket{protocol="http",target="8.8.8.8",le="0.005"} 8183
uptime_check_duration_seconds_bucket{protocol="http",target="8.8.8.8",le="0.01"} 8187
...
uptime_check_duration_seconds_bucket{protocol="http",target="8.8.8.8",le="5"} 8194
uptime_check_duration_seconds_bucket{protocol="http",target="8.8.8.8",le="10"} 8194
uptime_check_duration_seconds_bucket{protocol="http",target="8.8.8.8",le="+Inf"} 8194
uptime_check_duration_seconds_sum{protocol="http",target="8.8.8.8"} 2.060316312000004
uptime_check_duration_seconds_count{protocol="http",target="8.8.8.8"} 8194
uptime_check_duration_seconds_bucket{protocol="http",target="https://www.google.com",le="0.005"} 0
uptime_check_duration_seconds_bucket{protocol="http",target="https://www.google.com",le="0.01"} 0
...
uptime_check_duration_seconds_bucket{protocol="http",target="https://www.google.com",le="10"} 8194
uptime_check_duration_seconds_bucket{protocol="http",target="https://www.google.com",le="+Inf"} 8194
uptime_check_duration_seconds_sum{protocol="http",target="https://www.google.com"} 1267.3501819319974
uptime_check_duration_seconds_count{protocol="http",target="https://www.google.com"} 8194
uptime_check_duration_seconds_bucket{protocol="icmp",target="8.8.8.8",le="0.005"} 0
uptime_check_duration_seconds_bucket{protocol="icmp",target="8.8.8.8",le="0.01"} 0
...
uptime_check_duration_seconds_bucket{protocol="icmp",target="8.8.8.8",le="2.5"} 8186
uptime_check_duration_seconds_bucket{protocol="icmp",target="8.8.8.8",le="5"} 8186
uptime_check_duration_seconds_bucket{protocol="icmp",target="8.8.8.8",le="10"} 8186
uptime_check_duration_seconds_bucket{protocol="icmp",target="8.8.8.8",le="+Inf"} 8194
uptime_check_duration_seconds_sum{protocol="icmp",target="8.8.8.8"} 2758.2098659350017
uptime_check_duration_seconds_count{protocol="icmp",target="8.8.8.8"} 8194
uptime_check_duration_seconds_bucket{protocol="icmp",target="https://www.google.com",le="0.005"} 1
uptime_check_duration_seconds_bucket{protocol="icmp",target="https://www.google.com",le="0.01"} 1
uptime_check_duration_seconds_bucket{protocol="icmp",target="https://www.google.com",le="0.025"} 1385
...
uptime_check_duration_seconds_sum{protocol="icmp",target="https://www.google.com"} 1791.4891992930002
uptime_check_duration_seconds_count{protocol="icmp",target="https://www.google.com"} 8194
# HELP uptime_checks_success_total Total number of successful checks
# TYPE uptime_checks_success_total counter
uptime_checks_success_total{protocol="http",target="https://www.google.com"} 7999
uptime_checks_success_total{protocol="icmp",target="8.8.8.8"} 8194
uptime_checks_success_total{protocol="icmp",target="https://www.google.com"} 8194
# HELP uptime_checks_total Total number of checks performed
# TYPE uptime_checks_total counter
uptime_checks_total{protocol="http",target="8.8.8.8"} 8194
uptime_checks_total{protocol="http",target="https://www.google.com"} 8194
uptime_checks_total{protocol="icmp",target="8.8.8.8"} 8194
uptime_checks_total{protocol="icmp",target="https://www.google.com"} 8194

Grafana ダッシュボード (http://localhost:3000/)
平均レイテンシー
- rate(uptime_check_duration_seconds_sum{protocol="http", target="https://www.google.com"}[5m]) / rate(uptime_check_duration_seconds_count{protocol="http", target="https://www.google.com"}[5m])
- 「過去5分間のレイテンシーの合計」 / 「過去5分間のリクエスト回数」 = 「過去 5 分間平均のレイテンシー」
HTTP 成功率
- rate(uptime_checks_success_total{protocol="http", target="https://www.google.com"}[5m]) / rate(uptime_checks_total{protocol="http", target="https://www.google.com"}[5m])
- 「過去 5 分間の HTTP 成功数」 / 「過去 5 分間の HTTP リクエスト数」。
HTTP 失敗数
- increase(uptime_checks_total{protocol="http", target="https://www.google.com"}[5m]) - increase(uptime_checks_success_total{protocol="http", target="https://www.google.com"}[5m])
- 「過去 5 分間の HTTP リクエスト数」 - 「過去 5 分間の HTTP 成功数」 = 「過去 5 分間の HTTP 失敗数」
レイテンシーのヒストグラム p90
- histogram_quantile(0.90, rate(uptime_check_duration_seconds_bucket{protocol="http"}[5m]))

アーキテクチャ

ローカル版 (Docker Compose)

[ ブラウザ ]
     │        (http)
     ▼
┌──────────────┐      ┌──────────────┐
│   Grafana    │◀────▶│  Prometheus  │
│ (3000)       │       │ (9090)      │
└───┬──────────┘       └───┬──────────┘
    │   PromQL              │ scrape /metrics
    ▼                       │
┌──────────────┐            │
│ Alertmanager │            │
│   (9093)     │            │
└───┬──────────┘            │
    │                        │
    ▼                        │
┌──────────────┐             │
│ Go Exporter  │─────────────┘
│   (9183)     │──HTTP/ICMP──▶ 監視対象
└──────────────┘

EKS版

┌─────────┐
│ Browser │
└────┬────┘
     │ HTTPS (ACM)
     ▼
┌──────────────────────────────────────────┐
│ AWS Load Balancer Controller (ALB)      │
│ ・Ingress リソースを監視                │
│ ・ACM 証明書を HTTPS Listener にアタッチ │
└────┬──────────┬───────────┬───────────┘
     │          │           │
     ▼          │           ▼
┌────────┐      │      ┌─────────────┐
│ Grafana│◀─────┘      │ Alertmanager│
│ Pod×1  │             │ Pod×1       │
└──┬─────┘             └───────┬─────┘
   │ PromQL                      │ 通知
   ▼                             ▼
┌──────────┐        scrape      ┌─────────────────┐
│Prometheus│◀──────────────────▶│ Go Exporter Pod  │
│ Pod×1    │    (/metrics)     └─────────────────┘
└──────────┘

実装手順、所要時間

フェーズ	タスク内容	所要時間
準備、環境構築	プラニング、Git リポジトリ初期化、ディレクトリ構成作成、Docker Compose 開発環境準備	1h
監視サーバーの実装	Go 実装（HTTP/ICMP チェック、Prometheus メトリクス公開）	2h
監視サーバーのDocker化	Dockerfile 作成 → Docker Compose に Prometheus, Alertmanager も含めて組み込み	1h
Prometheus, Alertmanager 設定	スクレイプ・アラートルール設定・通知を受ける Slack 側の準備	1.5h
Grafana 追加	Grafana を Compose に追加、Prometheus を DataSource に設定、初期ダッシュボード構築	1.5h
ローカルでテスト	ブラウザでダッシュボード確認、アラート通知確認	1h
Kubernetes (EKS) 化 - 1	コマンドラインツールインストール、ACM 証明書作成、全体像把握など	1h
Kubernetes (EKS) 化 - 2	EKS クラスタ作成、AWS Load Balancer Controller 導入, PVC 作成など (途中で断念)	4h~
		計 13 h 程度

学び、雑感

Prometheus／Alertmanager／Grafana は思った以上にスムーズに動かせた。普段から CloudWatch Metrics に慣れている身としては、Prometheus の時系列DB（TSDB）やラベルベースのデータモデル、PromQL のクセには多少戸惑ったが、全体としては理解しやすかった。Alertmanager は CloudWatch Alarms、Grafana は CloudWatch Dashboards に相当するものと考えると馴染みやすく、カスタマイズ性が高くて面白い。
- PromQL クエリを書くときに histogram 型に対して rate, increase を取得する違いを学んだ。
Kubernetes (EKS) 周りは難易度高め。 Grafana や Prometheus のダッシュボードを表示するだけでも、認証リダイレクトの 302 → 404 問題にハマって時間を取られている。EKS でのデプロイを試し始めてから EBS 関連のエラー・ALB Ingress のルーティング・など含め既に4時間以上トラシュー格闘している。現時点では一旦切り上げて、ここまでのプロセスをブログにまとめることにした。EKS の利用料金も気になったため、クラスタはひとまずクリーンアップ。
今回が Kubernetes 初挑戦だったが、4つのコンポーネント（Go Exporter, Prometheus, Alertmanager, Grafana）をクラスタにデプロイし、さらに永続ストレージ（EBS）や ALB Ingress を組み合わせる構成は、正直なところ一気にやるにはやや重かったと感じる。もう少しミニマルな構成か、整ったチュートリアル環境で Kubernetes＋EKS を再入門してみるのが良さそう。

今後

Kubernetes / EKS はいずれリベンジしたい。ECS も触ってみたい。
Next.js, Supabase あたりを使ったプロジェクトをやってみたい。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up