2月26日のAzure障害に対処した話

Last updated at 2021-03-05Posted at 2021-02-26

はじめに

2月26日の12時29分～19時02分にかけて発生したAzure東日本の障害、月末＆週末の午後に半日も奪われたことで残業コースを余儀なくされた人も多かったのではないでしょうか。

マイクロソフト公式アナウンスによると、Azure Storage（AWSでいうS3のようなもの）と Azure Storage を利用するサービス（Virtual Machines等）に、遅延の増大やサービスの全面ダウンがあったようです。

RCA Pending - Azure Storage and dependent services - Japan East (Tracking ID PLWV-BT0)
In light of new information, the following RCA is still preliminary and our investigations are continuing. While we have not changed any of the text below, we will provide an update once our investigation concludes.

Summary of Impact: Between 03:29 UTC and 10:02 UTC on 26 Feb 2021, a subset of customers in Japan East may have experienced service degradation and increased latency for resources utilizing Azure Storage, including failure of virtual machine disks. Some Azure services utilizing Storage may have also experienced downstream impact.

Root Cause: There were contributing factors that led to customer impact.

Firstly, we had an active deployment in progress on a single storage scale unit. Our safe deployment process normally reserves some resources within a scale unit so that deployments can take place. In addition to this space being reserved for the deployment, some nodes in the scale unit entered an unhealthy state and so they were removed from use from the scale unit. The final factor was that resource demand on the scale unit was unusually high.

In this case, our resource balancing automation was not able to keep up and spread the load to other scale units. A combination of all these factors resulted in a high utilization of this scale unit causing it to be heavily throttled in order to prevent failure. This resulted in a loss of availability for customers and Azure services attempting to utilize Storage resources within the impacted storage scale unit.

Mitigation: To mitigate customer impact as fast as possible, unhealthy nodes were recovered which restored resources to the service. In addition, engineers took steps to aggressively balance resource load out of the storage scale unit.

Once Storage services were recovered around 06:56 UTC, dependent services started recovering. We declared full mitigation at 10:02 UTC.

Next steps: We sincerely apologize for the impact this event had on our customers. Next steps include but are not limited to:

Improve detection and alerting when auto-balancing is not keeping up to help quickly trigger manual mitigation steps.
Reduce the maximum allowed resource utilization levels for smaller storage scale units to help ensure increased resource headroom in the face of multiple unexpected events.

RCA（Root cause analysis＝根本原因分析）レポートが公開されたので、ちょっと噛み砕いてみます。
ただし、3月5日現在このRCAは暫定的なもので、調査は継続中であることに留意して下さい。

RCA 抄訳

この障害は複数の要因が重なって発生しました。

第１の要因は、単一のストレージスケールユニットと呼ばれる単位でデプロイが進行中だったこと。通常のデプロイ処理では、スケールユニット内のリソースの一部を事前に確保した上で行うのですが、今回は予約した領域に加え、一部のノードがたまたま異常な状態にあったので対象スケールユニットから除外されていました。

第２の要因は、対象スケールユニットへのリソース要求が異常に高い状況だったこと。このような場合、リソースバランシングが他のスケールユニットに自動で負荷を分散しますが、処理が追い付かず、対象のスケールユニットの使用率が増大、異常を防ぐために大幅なスロットリング（リクエスト数を制限すること）が行われました。

結果、影響を受けたスケールユニットに配置されている Azure Storage を利用するサービスが使えなくなりました。

対応としては、異常状態で除外されていたノードを復旧し、リソースを復元するとともに、対象スケールユニットから手動で負荷を分散させました。

再発防止策としては、

自動負荷分散の処理が追い付かない状況の検出とアラートを改善し、エンジニアが手動で素早く対処できるようにする。
スケールユニットの最大利用率を引き下げ、リソースに余裕を持たせる。

影響

奉行シリーズでお馴染みのOBC社でも、奉行クラウド等の障害原因はAzureにあることを公表しました。

自社サービスがどこのクラウドで動いているとか以前はハッキリ言わない傾向にあったと思うのですが、最近はそうでもないのですね。
Salesforceの設定不備に端を発した一連のセキュリティインシデントもそうでしたね。

今回、お取引先様のWebアプリケーション（B2Bサービス）がAzureVMインスタンスのAシリーズを利用しており、サーバが全面ダウン。その対応に追われました。

データベース（PostgreSQL）は所謂マネージドでは無く、VMにマウントしているデータディスクに実体があったため、データのロスト（先祖返り）が最大の心配事でした。そこまで大きなシステムでは無いので、データベースを西日本リージョンにレプリケーションしていませんでした。
VMが一時的に復旧することもあったようですが、データディスクにアクセスできないからなのかApacheが503エラーを返していた時間もあったらしいです。

16時頃、ストレージサービスが復旧したとアナウンスがあったので、お取引先様に Azure Portal からVMを再構成していただき、無事にサービスを再開できました。その後、朝～アクセス不可までのデータをすべて確認できたので、恐れていたデータのロストが無かったのは救いでした。

Azureのインシデントを確認するには、広範な障害をお知らせする Azureの状態 と、個別にカスタマイズできる Azure Service Health があります。
Azure Service Health は、障害や計画メンテナンスを含むサービス全般の問題についてパーソナライズされた情報とサポートを提供するものです。時間があったら設定してみようと思います。

マイクロソフトほどの大企業になると公式アナウンスはどうしても慎重になりがちで、通知が遅れてしまいます。
ある程度は見切り発車も必要で、Twitterでの反応は参考になりました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up