科学と神々株式会社 Advent Calendar 2025

Hybrid License System Day 25: 運用とモニタリング

Last updated at 2025-12-24Posted at 2025-12-24

🎄 科学と神々株式会社アドベントカレンダー 2025

Hybrid License System Day 25: 運用とモニタリング

統合・デプロイ編 (5/5) - 最終日！

🎉 はじめに

Day 25、ついに最終日です！運用とモニタリングについて学び、Hybrid License Systemを本番環境で安定稼働させる方法を理解しましょう。

25日間の集大成として、Prometheus統合、Grafanaダッシュボード、ログ集約、アラート設定を実装します。

📊 モニタリングの重要性

なぜモニタリングが必要なのか？

問題発生 → 検知 → 原因特定 → 対処 → 再発防止
     ↑                                    ↓
     └────────────────────────────────────┘
           モニタリングによる継続的改善

モニタリングの目的:

障害の早期検知: ユーザーが気づく前に問題を発見
パフォーマンス最適化: ボトルネックの特定と改善
容量計画: スケーリングのタイミング判断
SLA遵守: サービスレベル目標の達成確認

🔍 Four Golden Signals

Googleが提唱する、モニタリングで重視すべき4つの指標：

1. Latency（レイテンシー）

リクエストからレスポンスまでの時間

// Express.jsミドルウェアでレイテンシー計測
app.use((req, res, next) => {
  const start = Date.now();

  res.on('finish', () => {
    const duration = Date.now() - start;

    // Prometheusメトリクス送信
    httpRequestDuration.observe(
      { method: req.method, route: req.route?.path, status_code: res.statusCode },
      duration / 1000  // 秒単位
    );

    console.log({
      method: req.method,
      path: req.path,
      statusCode: res.statusCode,
      duration: `${duration}ms`
    });
  });

  next();
});

目標値:

p50 (中央値): <25ms
p95: <50ms
p99: <100ms

2. Traffic（トラフィック）

システムへのリクエスト量

// リクエストカウンター
const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

app.use((req, res, next) => {
  res.on('finish', () => {
    httpRequestsTotal.inc({
      method: req.method,
      route: req.route?.path || req.path,
      status_code: res.statusCode
    });
  });

  next();
});

監視項目:

1秒あたりのリクエスト数（RPS）
エンドポイント別のトラフィック分布
ピーク時間帯の特定

3. Errors（エラー）

失敗したリクエストの割合

// エラーレートカウンター
const httpErrorsTotal = new Counter({
  name: 'http_errors_total',
  help: 'Total number of HTTP errors',
  labelNames: ['method', 'route', 'status_code']
});

app.use((req, res, next) => {
  res.on('finish', () => {
    if (res.statusCode >= 400) {
      httpErrorsTotal.inc({
        method: req.method,
        route: req.route?.path || req.path,
        status_code: res.statusCode
      });
    }
  });

  next();
});

目標値:

エラーレート: <0.1%（1000リクエストに1回以下）
4xxエラー: クライアント起因（バリデーション改善）
5xxエラー: サーバー起因（即座に対処）

4. Saturation（飽和度）

システムリソースの使用率

// CPU・メモリ使用率
const processMemoryUsage = new Gauge({
  name: 'process_memory_usage_bytes',
  help: 'Memory usage in bytes'
});

const processCpuUsage = new Gauge({
  name: 'process_cpu_usage_percent',
  help: 'CPU usage percentage'
});

setInterval(() => {
  const memUsage = process.memoryUsage();
  processMemoryUsage.set(memUsage.heapUsed);

  const cpuUsage = process.cpuUsage();
  processCpuUsage.set(
    (cpuUsage.user + cpuUsage.system) / 1000000  // マイクロ秒を秒に変換
  );
}, 10000);  // 10秒ごと

目標値:

CPU使用率: <70%
メモリ使用率: <80%
ディスクI/O: <80%

📈 Prometheus統合

Prometheusとは？

Prometheusは、オープンソースの監視・アラートシステムです。時系列データベースを内蔵し、メトリクス収集とクエリが可能です。

Auth Serviceへのプロメテウス統合

// auth-service/src/standalone.js
const promClient = require('prom-client');

// デフォルトメトリクス収集（CPU、メモリなど）
promClient.collectDefaultMetrics({
  timeout: 5000,
  prefix: 'auth_service_'
});

// カスタムメトリクス定義
const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.001, 0.01, 0.1, 0.5, 1, 2.5, 5, 10]
});

const licenseActivations = new promClient.Counter({
  name: 'license_activations_total',
  help: 'Total number of license activations',
  labelNames: ['plan', 'status']
});

const licenseValidations = new promClient.Counter({
  name: 'license_validations_total',
  help: 'Total number of license validations',
  labelNames: ['status']
});

// メトリクスエンドポイント
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', promClient.register.contentType);
  res.end(await promClient.register.metrics());
});

// ライセンスアクティベーションでメトリクス記録
app.post('/activate', async (req, res) => {
  const start = Date.now();

  try {
    const result = await authService.activate(email, password, clientId);

    licenseActivations.inc({ plan: result.license.plan, status: 'success' });
    httpRequestDuration.observe(
      { method: 'POST', route: '/activate', status_code: 200 },
      (Date.now() - start) / 1000
    );

    res.status(200).json(result);
  } catch (error) {
    licenseActivations.inc({ plan: 'unknown', status: 'failure' });
    httpRequestDuration.observe(
      { method: 'POST', route: '/activate', status_code: 401 },
      (Date.now() - start) / 1000
    );

    res.status(401).json({ error: error.message });
  }
});

Prometheus設定ファイル

# prometheus.yml
global:
  scrape_interval: 15s      # 15秒ごとにメトリクス収集
  evaluation_interval: 15s   # 15秒ごとにルール評価

scrape_configs:
  # API Gateway
  - job_name: 'api-gateway'
    static_configs:
      - targets: ['api-gateway:3000']
    metrics_path: '/metrics'

  # Auth Service
  - job_name: 'auth-service'
    static_configs:
      - targets: ['auth-service:3001']
    metrics_path: '/metrics'

  # Admin Service
  - job_name: 'admin-service'
    static_configs:
      - targets: ['admin-service:3002']
    metrics_path: '/metrics'

# アラートルール
rule_files:
  - '/etc/prometheus/alerts.yml'

# Alertmanager設定
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

アラートルール

# alerts.yml
groups:
  - name: license_system_alerts
    interval: 30s
    rules:
      # 高いエラーレート
      - alert: HighErrorRate
        expr: rate(http_errors_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes"

      # 高いレスポンスタイム
      - alert: HighLatency
        expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "p99 latency is {{ $value }}s over the last 5 minutes"

      # サービスダウン
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.job }} is down"
          description: "Service {{ $labels.job }} has been down for more than 1 minute"

      # 高いメモリ使用率
      - alert: HighMemoryUsage
        expr: (process_memory_usage_bytes / process_memory_limit_bytes) > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage detected"
          description: "Memory usage is {{ $value | humanizePercentage }} of the limit"

📊 Grafanaダッシュボード

Grafanaとは？

Grafanaは、Prometheusなどのデータソースからメトリクスを可視化するダッシュボードツールです。

ダッシュボード構成例

{
  "dashboard": {
    "title": "Hybrid License System - Overview",
    "panels": [
      {
        "title": "Request Rate (RPS)",
        "targets": [
          {
            "expr": "rate(http_requests_total[1m])",
            "legendFormat": "{{job}} - {{method}} {{route}}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Response Time (p95, p99)",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "p95"
          },
          {
            "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "p99"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "rate(http_errors_total[5m]) / rate(http_requests_total[5m])",
            "legendFormat": "{{job}}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "License Activations",
        "targets": [
          {
            "expr": "license_activations_total",
            "legendFormat": "{{plan}} - {{status}}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Memory Usage",
        "targets": [
          {
            "expr": "process_memory_usage_bytes",
            "legendFormat": "{{job}}"
          }
        ],
        "type": "graph"
      }
    ]
  }
}

📝 ログ集約（ELKスタック）

ELKスタックとは？

ELK = Elasticsearch + Logstash + Kibana

Elasticsearch: ログデータの保存・検索エンジン
Logstash: ログ収集・変換・転送
Kibana: ログ可視化・分析UI

構造化ログの実装

// auth-service/src/logger.js
const winston = require('winston');

const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.errors({ stack: true }),
    winston.format.json()
  ),
  defaultMeta: {
    service: 'auth-service',
    version: '1.0.0'
  },
  transports: [
    new winston.transports.Console({
      format: winston.format.combine(
        winston.format.colorize(),
        winston.format.simple()
      )
    }),
    new winston.transports.File({
      filename: '/var/log/auth-service/error.log',
      level: 'error'
    }),
    new winston.transports.File({
      filename: '/var/log/auth-service/combined.log'
    })
  ]
});

// ログ出力例
logger.info('License activation started', {
  email: 'user@example.com',
  clientId: 'client-123',
  plan: 'free'
});

logger.error('License activation failed', {
  email: 'user@example.com',
  clientId: 'client-123',
  error: error.message,
  stack: error.stack
});

Logstash設定

# logstash.conf
input {
  file {
    path => "/var/log/*/combined.log"
    codec => "json"
  }
}

filter {
  if [level] == "error" {
    mutate {
      add_tag => ["error"]
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "license-system-%{+YYYY.MM.dd}"
  }

  # 重要なエラーは即座にアラート
  if "error" in [tags] and [service] == "auth-service" {
    email {
      to => "ops@example.com"
      subject => "Auth Service Error"
      body => "%{message}"
    }
  }
}

🚨 アラート設定

Alertmanager設定

# alertmanager.yml
global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@example.com'
  smtp_auth_username: 'alerts@example.com'
  smtp_auth_password: 'password'

route:
  receiver: 'team-email'
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'

    - match:
        severity: warning
      receiver: 'team-slack'

receivers:
  - name: 'team-email'
    email_configs:
      - to: 'team@example.com'

  - name: 'team-slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/...'
        channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'your-pagerduty-key'

🎓 運用ベストプラクティス

1. 定期的なヘルスチェック

# cronで5分ごとにヘルスチェック
*/5 * * * * curl -f http://localhost:3000/health || echo "API Gateway is down!" | mail -s "Alert" ops@example.com

2. 自動バックアップ

# 毎日深夜にデータベースバックアップ
0 0 * * * docker run --rm -v hybrid_database-data:/data -v /backup:/backup alpine tar czf /backup/db-$(date +\%Y\%m\%d).tar.gz /data

3. ログローテーション

# /etc/logrotate.d/license-system
/var/log/license-system/*.log {
  daily
  rotate 30
  compress
  delaycompress
  notifempty
  create 0640 nodejs nodejs
  sharedscripts
  postrotate
    docker-compose restart
  endscript
}

4. セキュリティアップデート

# 週次でセキュリティアップデート
0 2 * * 0 docker-compose pull && docker-compose up -d

🎉 25日間の総まとめ

完成したシステム

✅ API Gateway (Express.js)
   - ルーティング・プロキシ
   - レート制限
   - CORS・セキュリティヘッダー

✅ Auth Service (Node.js)
   - ライセンスアクティベーション・検証
   - JWT生成・検証
   - BCryptパスワードハッシング

✅ Admin Service (Express.js + React)
   - 管理ダッシュボード
   - ユーザー・ライセンス管理
   - 統計API

✅ インフラ
   - Docker Compose設定
   - Prometheus/Grafana監視
   - ELKログ集約
   - アラート設定

学んだスキル

マイクロサービスアーキテクチャ - 設計原則とパターン
API Gateway実装 - Express.jsでのルーティング・セキュリティ
認証・認可 - JWT/BCryptの実装
データベース設計 - SQLite + better-sqlite3
Docker/Kubernetes - コンテナ化とオーケストレーション
テスト - 統合テスト・E2Eテスト
CI/CD - 自動化パイプライン
運用・監視 - Prometheus/Grafana/ELK

🚀 今後の展開

短期（1-2週間）

CORS テスト修正
Dockerイメージの最適化
CI/CD パイプライン構築

中期（1-2ヶ月）

Kubernetes対応
管理ダッシュボードUI完成
WebAssemblyクライアント実装

長期（3-6ヶ月）

マルチリージョン対応
リアルタイム通知（WebSocket）
オフラインライセンス検証

🙏 おわりに

25日間のアドベントカレンダー、お疲れさまでした！

マイクロサービスアーキテクチャによる商用グレードのライセンス認証システムを、設計から実装、デプロイ、運用まで一通り学びました。

このシステムは実際の本番環境でも使用できる品質を目指して設計されています。ぜひご自身のプロジェクトに応用してみてください。

Happy Coding! 🎉

🔗 関連リンク

全25日間、ありがとうございました！🎄

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

Hybrid License System Day 25: 運用とモニタリング

🎄 科学と神々株式会社 アドベントカレンダー 2025

🎉 はじめに

📊 モニタリングの重要性

なぜモニタリングが必要なのか？

🔍 Four Golden Signals

1. Latency（レイテンシー）

2. Traffic（トラフィック）

3. Errors（エラー）

4. Saturation（飽和度）

📈 Prometheus統合

Prometheusとは？

Auth Serviceへのプロメテウス統合

Prometheus設定ファイル

アラートルール

📊 Grafanaダッシュボード

Grafanaとは？

ダッシュボード構成例

📝 ログ集約（ELKスタック）

ELKスタックとは？

構造化ログの実装

Logstash設定

🚨 アラート設定

Alertmanager設定

🎓 運用ベストプラクティス

1. 定期的なヘルスチェック

2. 自動バックアップ

3. ログローテーション

4. セキュリティアップデート

🎉 25日間の総まとめ

完成したシステム

学んだスキル

🚀 今後の展開

短期（1-2週間）

中期（1-2ヶ月）

長期（3-6ヶ月）

🙏 おわりに

🔗 関連リンク

🎄 科学と神々株式会社アドベントカレンダー 2025