RDS Blue/Green Deployment Switch Over時の `--read-only` エラー：調査観点と設定指針

Posted at 2025-09-30

はじめに

AWS RDS の Blue/Green Deployment で Switch Over を実行した際、Rails アプリケーションで以下のようなエラーが発生することがあります。

Mysql2::Error: The MySQL server is running with the --read-only option so it cannot execute this statement: INSERT INTO `sessions` (`session_id`, `data`, `created_at`, `updated_at`) VALUES (...)

本記事では、この問題の根本原因である 既存TCP接続の保持問題 を中心に、調査すべき観点と予防的設定について解説します。

根本原因：既存TCP接続の保持

1. Blue/Green Switch Over のメカニズム

Blue/Green Deployment の Switch Over では以下の流れでトラフィックが切り替わります：

Green環境（現行）で稼働中
Blue環境（新バージョン）を並行構築
Switch Over実行：DNS エンドポイントがGreen→Blueを指すように変更
Green環境は read-only モードに移行
Blue環境が新しいプライマリとして稼働

2. TCP接続保持による問題

Rails アプリケーションの接続プールは、物理的なTCP接続を保持します。Switch Over後も：

DNS解決は新しいIP（Blue環境）を返す
しかし既存のTCP接続は古いIP（Green環境）を参照し続ける
古いGreen環境は read-only のため、INSERT/UPDATE文でエラー発生

調査すべき観点

1. ネットワーク・インフラ構成の確認

DNS解決とTCP接続の関係

# Switch Over前後でのDNS解決確認
dig +short your-rds-endpoint.amazonaws.com

# 既存TCP接続の確認
netstat -an | grep :3306
# 10.0.1.100:3306 ESTABLISHED ← 古いIP (Green環境)
# 10.0.1.200:3306 ESTABLISHED ← 新しいIP (Blue環境)

AZ配置による影響調査

# EC2インスタンスのAZ確認
curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone

# RDS配置の確認
aws rds describe-db-instances --db-instance-identifier your-db \
  --query 'DBInstances[0].{AZ:AvailabilityZone,MultiAZ:MultiAZ}'

調査ポイント：

EC2とRDSが同一AZか異なるAZか
Cross-AZ接続の方がTCP接続が切断されにくい特性
Same-AZ接続の方が障害検知が早い特性

2. アプリケーションレベルでの調査

接続プールの状態確認

# Rails Console で実行
ActiveRecord::Base.connection_pool.stat
# => {:size=>5, :connections=>3, :busy=>1, :dead=>2, :idle=>0, :waiting=>0}
#    重要：dead接続の存在を確認

# 実際の接続先確認
ActiveRecord::Base.connection.execute("SELECT @@hostname, @@read_only")

Reaper設定の確認

# Rails Console で実行
ActiveRecord::Base.connection_pool.instance_variable_get(:@reaping_frequency)
# => 60.0 (設定されている場合)
# => nil  (未設定の場合)

ActiveRecord::Base.connection_pool.instance_variable_get(:@reaper)
# => Reaperオブジェクトの有無を確認

3. Web サーバー構成による影響

Passenger環境での調査

# プロセス数とUptime確認
passenger-status

# 長時間稼働プロセスほど古い接続を保持しやすい
# Version : 6.0.12
# Processes     : 2
#   * PID: 2712    Uptime: 9h 23m 29s  ← 長時間稼働
#   * PID: 7648    Uptime: 8h 15m 22s

Load Balancer配下での部分障害調査

# ALBターゲットグループの健全性確認
aws elbv2 describe-target-health --target-group-arn your-target-group-arn

# 一部インスタンスのみエラーが発生していないか確認

設定による予防策

1. 接続プール設定の最適化

database.yml での積極的な接続回収設定

production:
  adapter: mysql2
  pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>
  timeout: 5000
  checkout_timeout: 5
  
  # 接続回収の積極化
  reaping_frequency: 30           # デフォルト60→30秒に短縮
  idle_timeout: 60                # アイドル接続の強制切断を短縮
  
  # TCP/MySQL レベルでの接続監視
  variables:
    wait_timeout: 120             # MySQL側の接続タイムアウト
    interactive_timeout: 120      # インタラクティブ接続タイムアウト
    
  # 接続の健全性確認
  reconnect: true                 # 自動再接続有効

設定のポイント：

reaping_frequency: 古い接続の定期回収間隔を短縮
idle_timeout: 長時間アイドル状態の接続を積極的に切断
MySQL変数: サーバー側からも古い接続を切断

2. 接続健全性チェックの実装

定期的なread-only状態監視

# config/initializers/database_health_monitor.rb
Rails.application.config.after_initialize do
  Thread.new do
    loop do
      begin
        result = ActiveRecord::Base.connection.execute("SHOW VARIABLES LIKE 'read_only'")
        if result.first['Value'] == 'ON'
          Rails.logger.error "Database read-only detected! Clearing connections..."
          ActiveRecord::Base.clear_all_connections!
          
          # 必要に応じて外部アラート
          # SlackNotifier.alert("Database read-only detected")
        end
      rescue => e
        Rails.logger.error "Database health check failed: #{e.message}"
        ActiveRecord::Base.clear_all_connections!
      end
      
      sleep 30  # チェック間隔
    end
  end if Rails.env.production?
end

Blue/Green Switch Over検知の自動化

# config/initializers/blue_green_monitor.rb
class BlueGreenSwitchDetector
  def self.start_monitoring
    Thread.new do
      last_hostname = nil
      
      loop do
        begin
          current_hostname = ActiveRecord::Base.connection.execute("SELECT @@hostname").first[0]
          
          if last_hostname && last_hostname != current_hostname
            Rails.logger.info "Blue/Green switch detected: #{last_hostname} -> #{current_hostname}"
            Rails.logger.info "Clearing connection pool proactively"
            ActiveRecord::Base.clear_all_connections!
          end
          
          last_hostname = current_hostname
        rescue => e
          Rails.logger.error "Blue/Green monitoring error: #{e.message}"
        end
        
        sleep 30
      end
    end
  end
end

# アプリケーション起動時に監視開始
Rails.application.config.after_initialize do
  BlueGreenSwitchDetector.start_monitoring if Rails.env.production?
end

3. エラー発生時の自動復旧機能

SQL実行エラーの監視と自動対処

# config/initializers/database_error_recovery.rb
ActiveSupport::Notifications.subscribe('sql.active_record') do |*args|
  event = ActiveSupport::Notifications::Event.new(*args)
  
  if event.payload[:exception_object]&.message&.include?('read-only')
    Rails.logger.error "READ-ONLY ERROR: Automatic recovery initiated"
    
    begin
      ActiveRecord::Base.clear_all_connections!
      Rails.logger.info "Connection pool cleared successfully"
      
      # 検証のため新しい接続でテスト実行
      ActiveRecord::Base.connection.execute("SELECT 1")
      Rails.logger.info "New connection verified as writable"
      
    rescue => recovery_error
      Rails.logger.error "Auto-recovery failed: #{recovery_error.message}"
      # 外部監視システムへの通知など
    end
  end
end

4. 運用面での対策設定

ヘルスチェックエンドポイントの実装

# config/routes.rb
Rails.application.routes.draw do
  get '/health/database', to: 'health#database'
end

# app/controllers/health_controller.rb
class HealthController < ApplicationController
  def database
    begin
      stat = ActiveRecord::Base.connection_pool.stat
      read_only_result = ActiveRecord::Base.connection.execute("SHOW VARIABLES LIKE 'read_only'").first
      hostname_result = ActiveRecord::Base.connection.execute("SELECT @@hostname").first
      
      render json: {
        status: read_only_result['Value'] == 'OFF' ? 'healthy' : 'read_only_error',
        hostname: hostname_result[0],
        pool_stat: stat,
        read_only: read_only_result['Value'] == 'ON',
        timestamp: Time.current
      }
    rescue => e
      render json: {
        status: 'error',
        error: e.message,
        timestamp: Time.current
      }, status: 500
    end
  end
end

Blue/Green Switch Over運用手順の標準化

# Switch Over実行時のチェックリスト

# 1. 事前準備
echo "=== Pre-Switch Checks ==="
curl -f http://production-app/health/database
passenger-status  # プロセス状態確認

# 2. Switch Over実行
# RDS ConsoleまたはAWS CLIでSwitch Over実行

# 3. 事後監視（5分間）
echo "=== Post-Switch Monitoring ==="
for i in {1..30}; do
  echo "Check $i/30:"
  curl -s http://production-app/health/database | jq '.status'
  sleep 10
done

# 4. 異常検知時の自動対処
curl -f http://production-app/health/database || {
  echo "Health check failed - restarting application..."
  passenger-config restart-app /var/www/hoge/production/current --rolling-restart
}

トラブルシューティング指針

1. 問題発生時の調査手順

Step 1: 現在の接続状況確認

# Rails Console で実行
puts "=== Connection Pool Status ==="
pp ActiveRecord::Base.connection_pool.stat

puts "=== Current Database State ==="
begin
  hostname = ActiveRecord::Base.connection.execute("SELECT @@hostname").first[0]
  read_only = ActiveRecord::Base.connection.execute("SHOW VARIABLES LIKE 'read_only'").first['Value']
  puts "Hostname: #{hostname}"
  puts "Read Only: #{read_only}"
rescue => e
  puts "Connection Error: #{e.message}"
end

Step 2: ネットワークレベル調査

# DNS解決の確認
echo "=== DNS Resolution ==="
dig +short your-rds-endpoint.amazonaws.com
nslookup your-rds-endpoint.amazonaws.com

# TCP接続の確認
echo "=== Active TCP Connections ==="
netstat -an | grep :3306 | head -10

# プロセス情報確認
echo "=== Application Processes ==="
passenger-status --verbose

Step 3: ログ分析

# Rails ログでエラーパターン確認
echo "=== Recent Database Errors ==="
grep -i "read.only\|mysql.*option" log/production.log | tail -10

# タイムスタンプでの相関分析
grep "$(date '+%Y-%m-%d %H:')" log/production.log | grep -i "database\|mysql"

2. 根本原因の特定方法

TCP接続の詳細調査

# lsofでプロセス別の接続確認
sudo lsof -p $(pgrep -f passenger) | grep :3306

# ss コマンドでより詳細な情報
ss -tuln | grep :3306

環境間差異の比較調査

# Production vs Staging 設定比較
echo "=== Database Configuration Comparison ==="
# Production
rails console -e production -c "pp Rails.application.config.database_configuration['production']"

# Staging  
rails console -e staging -c "pp Rails.application.config.database_configuration['staging']"

まとめ

RDS Blue/Green Deployment Switch Over時の read-only エラーは、既存TCP接続の保持が根本原因です。

調査時の重要な観点

ネットワーク構成の把握
- EC2とRDSのAZ配置関係
- DNS解決とTCP接続の状態分離
- Load Balancer配下での部分障害パターン
接続プールの状態監視
- Reaper設定の有効性確認
- 長時間稼働プロセスでの接続保持状況
- dead接続の蓄積パターン
アプリケーション負荷特性
- 環境間での接続使用パターンの違い
- Switch Over タイミングでの接続プール状態

予防的設定の方針

接続回収の積極化:

reaping_frequency の短縮（60→30秒）
idle_timeout の短縮で長時間接続を防止
MySQL変数での接続タイムアウト設定

自動監視・復旧機能:

read-only状態の定期監視
Blue/Green Switch Over の検知機能
エラー発生時の自動接続プールクリア

運用面での対策:

ヘルスチェックエンドポイントの実装
Switch Over手順の標準化
段階的なアプリケーション再起動手順

これらの設定により、手動対応なしでの自動回復が実現でき、Blue/Green Deployment の恩恵を最大限に活用できるようになります。

参考情報

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up