More than 5 years have passed since last update.

Azureでひっそりと動いてる割に重要なプロセスが見つかる！？

Posted at 2015-11-10

きっかけ

Azure Recovery Serviceで特定のVMで何度もエラーになるので調査してました。
なぜか、50分前後でスナップショットエラーで停止してしまう。。。

どこを探してもポータルから原因がわからないため、サポートへ連絡したら waagent.log を求められたのでナニモノかを調べてみた。

waagent.logの中身

2015/11/04 19:15:51 ERROR:Retry=0
2015/11/04 19:15:51 ERROR:HTTP Req: POST /machine?comp=health
2015/11/04 19:15:51 ERROR:HTTP Req: Data=<?xml version="1.0" encoding="utf-8"?><Health xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><GoalStateIncarnation>3</GoalStateIncarnation><Container><ContainerId>xxx-xxx-xxx</ContainerId><RoleInstanceList><Role><InstanceId>xxxx-hostname</InstanceId><Health><State>Ready</State></Health></Role></RoleInstanceList></Container></Health>
2015/11/04 19:15:51 ERROR:HTTP Req: Header={'Content-Type': 'text/xml; charset=utf-8', 'x-ms-version': '2012-11-30', 'x-ms-agent-name': 'WALinuxAgent'}
2015/11/04 19:15:51 ERROR:HTTP Err: Status=503
2015/11/04 19:15:51 ERROR:HTTP Err: Reason=Service Unavailable
2015/11/04 19:15:51 ERROR:HTTP Err: Header=[('date', 'Wed, 04 Nov 2015 19:15:50 GMT'), ('connection', 'close'), ('content-type', 'text/html; charset=us-ascii'), ('content-length', '326'), ('server', 'Microsoft-HTTPAPI/2.0')]
2015/11/04 19:15:51 ERROR:HTTP Err: Body=<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN""http://www.w3.org/TR/html4/strict.dtd">
2015/11/04 19:15:51 ERROR:<HTML><HEAD><TITLE>Service Unavailable</TITLE>
2015/11/04 19:15:51 ERROR:<META HTTP-EQUIV="Content-Type" Content="text/html; charset=us-ascii"></HEAD>
2015/11/04 19:15:51 ERROR:<BODY><h2>Service Unavailable</h2>
2015/11/04 19:15:51 ERROR:<hr><p>HTTP Error 503. The service is unavailable.</p>
2015/11/04 19:15:51 ERROR:</BODY></HTML>
2015/11/04 19:15:51 ERROR:

ERRORではないモノもあったが、最後はまさかの 503 internal server error で落ちていた。

Azure statusによると、上記の時にネットワーク障害があった模様。

11/4
Network Infrastructure - East Asia - Advisory

SUMMARY OF IMPACT: Between 20:25 and 22:02 on 04 Nov 2015 UTC some customers deployed in East Asia may have experienced network availability drops for their Azure services. During this time hosted services in the region may have experienced a minimal degradation in service availability. PRELIMINARY ROOT CAUSE: This has preliminarily been attributed to an underlying network device issue. MITIGATION: The issue self-healed and engineers have validated impact has subsided. NEXT STEPS: Networking teams will investigate the root cause of the networking device failure.

エラー内容を調べてみた

issueにあがってた(汗)

waagent dies when rate limited

2.1.2で対応(2015/11/10時点)とのこと...

waagentとはそもそも何なのか

仮想マシンと Azure ファブリックコントローラーとの相互動作を管理するための機能のようだが、今回の事象で影響がありそうだったのが下記の部分。

VM 拡張機能
* Microsoft やパートナーによって作成されたコンポーネントを Linux VM (IaaS) に挿入し、ソフトウェアおよび構成を自動化
* https://github.com/Azure/azure-linux-extensions にVM 拡張機能の参照実装

リポジトリを見てみたら VMBackup というところにモジュールが。。。

これが無いと キャプチャするときにエラーになる っぽい

確認したポイント

スナップ・ショット(リカバリサービス/キャプチャ)に関連する処理がエラー
拡張機能が表示されていることを確認
waagentがvmで起動していることを確認

[CentOS] ps aux|grep waagent
- (process) python /usr/sbin/waagent -daemon

/var/log/waagent.logの内容を確認
waagent --version でバージョンを確認

2015/11/10時点で2.0.14が最新
UPDATEが必要な場合

まとめ

結局これがわかるのに１週間も費やしてしまいました。リカバリサービス以外にも管理用に動いているようなので、監視と定期的なupdateをする必要がありそうです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up