More than 5 years have passed since last update.

AnsibleAdvent Calendar 2015

AnsibleでAWSのAutoScaling機能を利用してBlue-Greenデプロイを頑張った話

Last updated at 2015-12-17Posted at 2015-12-17

version
ansible 1.9.2

条件

・イミュータブルインフラストラクチャーの実現
・ダウンタイム0でログロストもしないような安全なデプロイ
・最小のコストで最大限のパフォーマンスを出せるAutoScalingを活用する

準備

AutoscalingGroup作成して稼働させておく（参考：Auto Scaling の使用開始
）
AutoScaling-LifeCycleHookのSNSとIAMの設定をしておく（参考：【新機能】Auto Scalingのインスタンス起動/破棄時に初期処理/終了処理を追加 – LifeCycleHook機能のご紹介）

デプロイの流れ

instance_idを受け取りAMIを作成する
作成したAMIからLaunchConfigrationを作成する
AutoScalingGroupの設定を変更する
ELBに新しいバージョンのサーバをServiceInさせる。（古いバージョンのものと同数）
新しいバージョンのサーバのServiceIn直後に古いバージョンのものをELBからServiceOutさせる
古いバージョンのものはログを送りきってからterminateさせる

上記の流れでBlue-Greenデプロイを行うplaybook

blue-green-deploy.yml

- hosts: localhost
  connection: local
  gather_facts: no
  vars:
    - aws:
        access_key: AAAAAAAAAA
        secret_key: BBBBBBBBBB
    - ec2:
        instance_type: t2.medium
        security_group_id: sg-xxxxxxx
        vpc_subnet_id_1a: subnet-aaaaaaaa
        vpc_subnet_id_1c: subnet-cccccccc
        availability_zone_1a: ap-northeast-1a
        availability_zone_1c: ap-northeast-1c
        region: ap-northeast-1
        associate_elb: elb-sample
    - heartbeat_timeout: 300
    - lifecycle_notify: arn:aws:sns:ap-northeast-1:xxxxx:hogehoge
    - lifecycle_role: arn:aws:iam::xxxxxxx:role/sample-asg-lifecycle
    - auto_scaling_group_name: asg_sample
  tasks:
    - name: AMIを作成
      ec2_ami:
        aws_access_key: "{{ aws.access_key }}"
        aws_secret_key: "{{ aws.secret_key }}"
        region: "{{ ec2.region }}"
        instance_id: "{{ ec2_instance_id }}"
        no_reboot: no
        wait: yes
      register: ami_result

    - name: LaunchConfigrationを作成
      ec2_lc:
        name: "lc_{{ ec2_instance_id }}"
        image_id: "{{ ami_result.image_id }}"
        region: "{{ ec2.region }}"
        security_groups: "{{ ec2.security_group_id }}"
        instance_type: "{{ ec2.instance_type }}"

    - name: 現在のauto-scaling-groupの情報を取ってくる
      shell: |
        aws autoscaling describe-auto-scaling-groups \
          --auto-scaling-group-names {{ auto_scaling_group_name }} \
        | jq .AutoScalingGroups[] \
        | jq \"{MinSize:.MinSize,MaxSize:.MaxSize,DesiredCapacity:.DesiredCapacity,Instances:[.Instances[]]}\"
      register: auto_scaling_group_desc

    - name: 現在のauto-scaling-groupの情報を変数にSETする
      set_fact:
        asg_desc: "{{ auto_scaling_group_desc.stdout|from_json }}"
      register: result
    - set_fact:
        asg_min_size: "{{ asg_desc.MinSize|int }}"
        asg_max_size: "{{ asg_desc.MaxSize|int }}"
        asg_desired_capacity: "{{ asg_desc.DesiredCapacity|int }}"
        asg_instances: "{{ asg_desc.Instances }}"

    - name: terminateさせるインスタンスを安全に落とすための設定をする
      shell: |
        aws autoscaling put-lifecycle-hook \
          --lifecycle-hook-name {{ auto_scaling_group_name }}-terminate \
          --auto-scaling-group-name {{ auto_scaling_group_name }} \
          --lifecycle-transition autoscaling:EC2_INSTANCE_TERMINATING \
          --notification-target-arn {{ lifecycle_notify }} \
          --role-arn {{ lifecycle_role }} \
          --heartbeat-timeout {{ heartbeat_timeout }}

    - name: auto-scaling-groupの設定をUPDATEする
      ec2_asg:
        name: "{{ auto_scaling_group_name }}"
        health_check_period: 300
        load_balancers: [ "{{ ec2.associate_elb }}" ]
        health_check_type: ELB
        availability_zones: [ "{{ ec2.availability_zone_1a }}", "{{ ec2.availability_zone_1c }}" ]
        launch_config_name: "{{ launch_config_name }}"
        min_size: "{{ asg_min_size }}"
        max_size: "{{ asg_max_size|int * 2 }}"
        desired_capacity: "{{ asg_desired_capacity|int * 2 }}"
        region: "{{ ec2.region }}"
        vpc_zone_identifier: [ "{{ ec2.vpc_subnet_id_1a }}", "{{ ec2.vpc_subnet_id_1c }}" ]
        state: present

    - name: termination-policiesを変更する（古いものから落とされるように）
      shell: |
        aws autoscaling update-auto-scaling-group \
          --auto-scaling-group-name {{ auto_scaling_group_name }} \
          --termination-policies \"OldestInstance\" \"OldestLaunchConfiguration\"

    - name: すべてのインスタンスが入れ替わるまで待つ
      shell: |
        aws elb describe-instance-health \
          --load-balancer-name {{ ec2.associate_elb }} \
        | jq 'map(.[]|select(.State == \"InService\"))|length'
      register: res_elb
      until: res_elb.stdout|int >= (asg_desired_capacity|int * 2)
      retries: 300
      delay: 5

    - name: 新しいLaunchConfigurationのインスタンスの台数を取得する
      shell: |
        aws autoscaling describe-auto-scaling-groups \
          --auto-scaling-group-names {{ auto_scaling_group_name }} \
        | jq .AutoScalingGroups[].Instances \
        | jq 'map(select(.LaunchConfigurationName == \"{{ launch_config_name }}\"))|length'
      register: new_instance_count

    - name: すべてのインスタンスが入れ替わったか確認する（うまくいっていなければ処理を止める）
      action: exit 1
      when: res_elb.attempts >= 300 or asg_desired_capacity|int > new_instance_count.stdout|int

    - name: 古いインスタンスをterminateする
      shell: |
        aws autoscaling terminate-instance-in-auto-scaling-group \
          --instance-id {{ item.InstanceId }} \
          --should-decrement-desired-capacity
      when: "{{ item.LifecycleState == 'InService' }}"
      with_items: "{{ asg_instances }}"

    - name: AutoScalingの設定を元に戻す
      shell: |
        aws autoscaling update-auto-scaling-group \
          --auto-scaling-group-name {{ auto_scaling_group_name }} \
          --max-size {{ asg_max_size }}

＿人人人人人人人人人人＿
＞　ほぼshell...orz　＜
￣ＹＹＹＹＹＹＹＹＹＹ￣

すみませんほぼshellですね...
粗雑ですがある程度動いているので勘弁してください
（実際に動かしたものとは違いますが、動くはず...！）

いざ実行してみる

実行コマンド

# instance_idを渡すだけ
ansible-playbook blue-green-deploy.yml --extra-vars="ec2_instance_id=i-hogehoge"

解説

デプロイ時の流れを詳しく説明します

1）現在のAutoscalingの設定を取ってくる

2）現在のAutoscalingのサーバスケール（desired）と現在SeviceInしているサーバのInstanceIdをメモしておく

3）Autoscalingの設定を変更する

変更内容：
　・launch_configを新しいバージョンのものにする（インスタンスの起動設定を最新版のものにする）
　・desiredを現在のちょうど2倍の値にする。minとmaxは変更しない

4）古いバージョンのものと新しいバージョンのものがすべてELBから見てサービスインしている状態になるまで待つ

ELBにのヘルスチェックが通って安全な状態になるまで待つ

- name: すべてのインスタンスが入れ替わるまで待つ
  shell: |
    aws elb describe-instance-health \
      --load-balancer-name {{ ec2.associate_elb }} \
    | jq 'map(.[]|select(.State == \"InService\"))|length'
  register: res_elb
  until: res_elb.stdout|int >= (asg_desired_capacity|int * 2)
  retries: 300
  delay: 5

aws cliを利用してelbにInServiceとなっているサーバが取得した現在のdesired capacityの倍の数になるまでリクエスト5秒おきに300回まで繰り返します

5）念のためサービスインしているサーバのうち、新しいバージョンのものがいくつあるか確認するために情報を取ってくる

新しいバージョンのサーバ台数取得

- name: 新しいLaunchConfigurationのインスタンスの台数を取得する
  shell: |
    aws autoscaling describe-auto-scaling-groups \
      --auto-scaling-group-names {{ auto_scaling_group_name }} \
    | jq .AutoScalingGroups[].Instances \
    | jq 'map(select(.LaunchConfigurationName == \"{{ launch_config_name }}\"))|length'
  register: new_instance_count

6 - 安全にデプロイできているかの確認（できていない場合は処理を中断）

安全にデプロイができたか確認

- name: すべてのインスタンスが入れ替わったか確認する（うまくいっていなければ処理を止める）
  action: exit 1
  when: res_elb.attempts >= 300 or asg_desired_capacity|int > new_instance_count.stdout|int

=> 下記2点の場合は処理を中断させる。（ダウンタイムを作らないようにするため）
　・ELBにサービスインしているサーバの台数を確認する処理がタイムアウトしていた場合
　・新しいバージョンのサーバ台数取得が元のサーバ台数より少ない場合

7 - 6でちゃんと確認ができたら、2でメモしておいたInstanceIdをterminatingの状態に移す

古いインスタンスをterminateする

- name: 古いインスタンスをterminateする
  shell: |
    aws autoscaling terminate-instance-in-auto-scaling-group \
      --instance-id {{ item.InstanceId }} \
      --should-decrement-desired-capacity
  when: "{{ item.LifecycleState == 'InService' }}"
  with_items: "{{ asg_instances }}"

8 - デプロイ完了です

古いバージョンのものはすぐにterminateされないようにしている

lifecyclehookのterminating:waitの設定だけしといてあげれば、とりあえず実際にterminateされるまで猶予を与えることができる。（その時間でfluentd等がログをすべて送り切るようにする）

lifecyclehook登録

aws autoscaling put-lifecycle-hook \
  --lifecycle-hook-name {{ auto_scaling_group_name }}-terminate \
  --auto-scaling-group-name {{ auto_scaling_group_name }} \
  --lifecycle-transition autoscaling:EC2_INSTANCE_TERMINATING \
  --notification-target-arn {{ lifecycle_notify }} \
  --role-arn {{ lifecycle_role }} \
  --heartbeat-timeout {{ heartbeat_timeout }}

Ansibleのec2_asg:replace_instancesについて

Ansibleのautoscalingモジュールにデプロイ時にローリングアップデートで自動でゆっくりサーバを切り替えてくれるものがあります。

ec2_asg:
  replace_instances: yes

しかしこれだと遅い&terminateing:waitとの相性が良くありません

terminateing:waitの状態を待ってしまう
（例：terminating:waitが5分の場合一つinstanceをterminatingしていて、次のinstanceをterminateするのは5分後。その間古いバージョンのものがELBにぶら下がり続けてしまう。）

テスト

簡易なテストですが、下記スクリプトで本当にダウンタイムが0か確認しました。

テストスクリプト

watch -n 1 "curl -s https://sample.io/http_status -o /dev/null -w '%{http_code}\n'"

ELBに正常にサーバがぶら下がっていないと503が発生します。
replace_instancesを使うと若干503が発生してしまいましたが、上記のplaybookならすべて200を確認できました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up