LoginSignup
5
2

こんにちは、京セラコミュニケーションシステム 石関 (@kccs_daisuke-ishizeki)です。

2023/8/3 にリリース(Preview)されました、Personalized Service Health(以下PSH) についての紹介です。
記事にするのが遅くなってしまいました。

2024 年 1 月に GA されました。

Google Cloud Service Health

みなさんは Google Cloud の障害情報をどのように拾っているでしょうか。

公式のステータスページ?そのRSS?
Downdetector?
X(Twitter)?

Google Cloud では公式からの情報として Google Cloud Service Health ダッシュボードで各プロダクトごとのステータス・インシデント情報が公開されています。
GoogleCloudServiceHealth.png

ですが、規模の小さなインシデントなどは掲載されておらず、サポートに問い合わせして判明するケースもありました。
また、すべてのロケーション、すべてのプロダクトのステータス情報が掲載されているため、自分に関係するインシデントが発生しているのかわかりにくいなと感じていました。

そんな方のための新しいプロダクトが Personalized Service Health(PSH) です!

Personalized Service Health(PSH) とは

プロジェクト(または組織)で利用のプロダクトに関連するインシデント、そのステータス、ロケーションなどの情報が提供されます。
その他、公開されていないインシデントに加え、Google Cloud Service Health からのインシデントもカバーされるようです。

また PSH では Service Health event と呼ばれるイベント情報が Cloud Logging にロギングされます。
(なのでアラートとしてメールや SMS、Slack などと連携することも容易に!)

利用の仕方は大きく3つです。

  • ダッシュボード
  • アラート(Cloud Monitoring, Cloud logging)
  • API

仕組みとしては以下の図のようになっています。
画像は公式ドキュメントより引用
diagram.png

さわってみる

前置きが長くなってしまいましたが、そんなわけでどんなものかさわってみました。

API を有効にすると PSH のダッシュボードが利用できるようになります。
また、ダッシュボードのほかアラートとして通知したり、API 経由でサービスヘルスイベントを取得できるようになります。

API の有効化

PSH を利用するには初めに Service Health API を有効にします。
PSH_1.png

PSH ダッシュボード

Service Health API を有効にするとダッシュボードが利用できるようになります。
PSH_2.png
プロジェクトで利用しているプロダクトに関連するインシデントがあるとこんな感じで表示されるようです。
PSH_3.png
インシデントの詳細画面
PSH_4.png
インシデントの履歴
PSH_5.png

158(2023/9/4時点)あるプロダクトの中から探さずに、プロジェクトに関連するインシデントのみが表示されるのが良いですね!

PSH アラート

ダッシュボード右上のCREATE ALERT POLICYからアラートポリシーを作成できます。

ここで作成したアラートポリシーは
PSH_6.png
Cloud Monitoring のアラートと連携しています。
連携というか PSH の画面から Cloud Monitoring のアラートポリシーを作成しているようです。
PSH_7.png
対象のアラートポリシーの詳細
ログベースのアラートポリシーが構成されています。
PSH_8.png

API アクセス

お試しで API をたたいてみました。

curl -X GET -H "Authorization: Bearer $(gcloud auth print-access-token)" "https://servicehealth.googleapis.com/v1beta/projects/{PROJECT_ID}/locations/global/events/n2h_aagLTVSDceNrl0Oy3Q"

こんな感じで取得できました。
ドキュメントのとおりですね!

{
  "name": "projects/{PROJECT_ID}/locations/global/events/n2h_aagLTVSDceNrl0Oy3Q",
  "title": "Multiple GCP products affected by Increased error rates for Google Cloud APIs in us-central1",
  "description": "The issue with Cloud Data Fusion, Dataproc Metastore, Google Cloud Console, Google Cloud Networking, Google Compute Engine has been resolved for all affected projects as of Thursday, 2023-08-03 17:27 US/Pacific.\n\nWe thank you for your patience while we worked on resolving the issue.",
  "category": "INCIDENT",
  "state": "CLOSED",
  "relevance": "UNKNOWN",
  "updates": [
    {
      "updateTime": "2023-08-03T22:52:38.704621Z",
      "title": "Increased error rates for GCE APIs",
      "description": "We are experiencing an issue with Google Compute Engine API.\n\nOur engineering team continues to investigate the issue.\n\nWe will provide an update by Thursday, 2023-08-03 17:00 US/Pacific with current details.\n\nWe apologize to all who are affected by the disruption.",
      "symptom": "Customers may experience increased rates of internal errors when calling global Compute Engine APIs.",
      "workaround": "None at this time."
    },
    {
      "updateTime": "2023-08-03T23:16:11.024011Z",
      "title": "Increased error rates for Google Cloud APIs in us-central1",
      "description": "We are experiencing an issue with Google Cloud Networking, Google Compute Engine.\n\nOur engineering team continues to investigate the issue.\n\nWe will provide an update by Thursday, 2023-08-03 17:15 US/Pacific with current details.\n\nWe apologize to all who are affected by the disruption.",
      "symptom": "Customers may experience increased rates of internal errors when calling Google Cloud APIs in us-central1, including when attempting to make changes to load balancers in the affected region.",
      "workaround": "None at this time."
    },
    {
      "updateTime": "2023-08-03T23:47:10.396836Z",
      "title": "Increased error rates for Google Cloud APIs in us-central1",
      "description": "We are experiencing an issue with Cloud Data Fusion, Dataproc Metastore, Google Cloud Networking, Google Compute Engine.\n\nOur engineering team continues to investigate the issue.\n\nWe will provide an update by Thursday, 2023-08-03 20:41 US/Pacific with current details.\n\nWe apologize to all who are affected by the disruption.",
      "symptom": "Customers may experience increased rates of internal errors when calling Google Cloud APIs in us-central1, including when attempting to make changes to load balancers in the affected region.",
      "workaround": "None at this time."
    },
    {
      "updateTime": "2023-08-03T23:55:24.329645Z",
      "title": "Multiple GCP products affected by Increased error rates for Google Cloud APIs in us-central1",
      "description": "We are experiencing an issue with Cloud Data Fusion, Dataproc Metastore, Google Cloud Networking, Google Compute Engine.\n\nOur engineering team continues to investigate the issue.\n\nWe will provide an update by Thursday, 2023-08-03 18:00 US/Pacific with current details.\n\nWe apologize to all who are affected by the disruption.",
      "symptom": "\n- Cloud Load Balancer customers may receive 502 errors when creating/updating load balancers in us-central1.\n- Increased error rates for GCE instance creation in us-central1\n-Dataproc Metastore creations may be failing in us-central1, us-south1, and us-east1\n-Increased error rates for GKE cluster operations",
      "workaround": "None at this time."
    },
    {
      "updateTime": "2023-08-04T00:01:40.548829Z",
      "title": "Multiple GCP products affected by Increased error rates for Google Cloud APIs in us-central1",
      "description": "We are experiencing an issue with Cloud Data Fusion, Dataproc Metastore, Google Cloud Networking, Google Compute Engine.\n\nOur engineering team continues to investigate the issue.\n\nWe will provide an update by Thursday, 2023-08-03 18:00 US/Pacific with current details.\n\nWe apologize to all who are affected by the disruption.",
      "symptom": "\n- Cloud Load Balancer customers may receive 502 errors when creating/updating load balancers in us-central1.\n- Increased error rates for GCE instance creation in us-central1\n-Dataproc Metastore creations may be failing in us-central1, us-south1, and us-east1\n-Increased error rates for GKE cluster operations\n-Customers may receive errors or may be unable to load pages in Cloud Console",
      "workaround": "None at this time."
    },
    {
      "updateTime": "2023-08-04T00:06:23.244036Z",
      "title": "Multiple GCP products affected by Increased error rates for Google Cloud APIs in us-central1",
      "description": "Mitigation work is currently underway by our engineering team and error rates appear to be decreasing.\n\nWe do not have an ETA for full mitigation at this point.\n\nWe will provide more information by Friday, 2023-08-04 02:41 US/Pacific.",
      "symptom": "\n- Cloud Load Balancer customers may receive 502 errors when creating/updating load balancers in us-central1.\n- Increased error rates for GCE instance creation in us-central1\n-Dataproc Metastore creations may be failing in us-central1, us-south1, and us-east1\n-Increased error rates for GKE cluster operations\n-Customers may receive errors or may be unable to load pages in Cloud Console",
      "workaround": "None at this time."
    },
    {
      "updateTime": "2023-08-04T00:07:21.715258Z",
      "title": "Multiple GCP products affected by Increased error rates for Google Cloud APIs in us-central1",
      "description": "Mitigation work is currently underway by our engineering team and error rates appear to be decreasing.\n\nWe do not have an ETA for full mitigation at this point.\n\nWe will provide more information by Thursday, 2023-08-03 18:00 US/Pacific.",
      "symptom": "\n- Cloud Load Balancer customers may receive 502 errors when creating/updating load balancers in us-central1.\n- Increased error rates for GCE instance creation in us-central1\n-Dataproc Metastore creations may be failing in us-central1, us-south1, and us-east1\n-Increased error rates for GKE cluster operations\n-Customers may receive errors or may be unable to load pages in Cloud Console",
      "workaround": "None at this time."
    },
    {
      "updateTime": "2023-08-04T00:18:43.598784Z",
      "title": "Multiple GCP products affected by Increased error rates for Google Cloud APIs in us-central1",
      "description": "At this time error rates appear to have returned to normal levels. Engineers are continuing to monitor to confirm full recovery.\n\nWe do not have an ETA for full mitigation at this point.\n\nWe will provide more information by Thursday, 2023-08-03 18:00 US/Pacific.",
      "symptom": "\n- Cloud Load Balancer customers may receive 502 errors when creating/updating load balancers in us-central1.\n- Increased error rates for Cloud Data Fusion instance creation in us-central1\n-Dataproc Metastore creations may be failing in us-central1, us-south1, and us-east1\n-Increased error rates for GKE cluster operations\n-Customers may receive errors or may be unable to load pages in Cloud Console",
      "workaround": "None at this time."
    },
    {
      "updateTime": "2023-08-04T00:28:57.893171Z",
      "title": "Multiple GCP products affected by Increased error rates for Google Cloud APIs in us-central1",
      "description": "The issue with Cloud Data Fusion, Dataproc Metastore, Google Cloud Console, Google Cloud Networking, Google Compute Engine has been resolved for all affected projects as of Thursday, 2023-08-03 17:27 US/Pacific.\n\nWe thank you for your patience while we worked on resolving the issue.",
      "symptom": "\n- Cloud Load Balancer customers may receive 502 errors when creating/updating load balancers in us-central1.\n- Increased error rates for Cloud Data Fusion instance creation in us-central1\n-Dataproc Metastore creations may be failing in us-central1, us-south1, and us-east1\n-Increased error rates for GKE cluster operations\n-Customers may receive errors or may be unable to load pages in Cloud Console",
      "workaround": "None at this time."
    }
  ],
  "updateTime": "2023-08-09T21:55:09.876347Z",
  "startTime": "2023-08-03T22:52:38.704621Z",
  "endTime": "2023-08-04T00:28:57.893171Z",
  "detailedState": "RESOLVED",
  "eventImpacts": [
    {
      "product": {
        "productName": "Cloud Data Fusion"
      },
      "location": {
        "locationName": "us-central1"
      }
    },
    {
      "product": {
        "productName": "Dataproc Metastore"
      },
      "location": {
        "locationName": "us-central1"
      }
    },
    {
      "product": {
        "productName": "Google Cloud Console"
      }
    },
    {
      "product": {
        "productName": "Google Compute Engine"
      },
      "location": {
        "locationName": "us-central1"
      }
    }
  ]
}

まとめ

PSH には Google Cloud Service Health に掲載されるインシデントも含まれるので、PSH ダッシュボードを見れば利用しているプロダクトのインシデントが一発で表示されるのが大きなメリットかなと思います。

あと Cloud Logging との連携!

API 自体の利用は無料ですので、とりあえず有効にするのも良いのではないでしょうか。

PSH の Cloud Logging 部分は _Default ログバケットに保存されます。
保持期間をデフォルトの 30 日から変更していなければ課金は発生しません。またログの取り込み量もログや頻度から想像してほぼ発生しないと思われます。

参考

5
2
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
5
2