More than 5 years have passed since last update.

posted at

updated at

Monitoring 101 (series #3) を日本語に翻訳してみる


Monitoring 101: Investigating performance issues

This post is part of a series on effective monitoring. Be sure to check out the rest of the series: Collecting the right data and Alerting on what matters.

The responsibilities of a monitoring system do not end with symptom detection. Once your monitoring system has notified you of a real symptom that requires attention, its next job is to help you diagnose the root cause. Often this is the least structured aspect of monitoring, driven largely by hunches and guess-and-check. This post describes a more directed approach that can help you to find and correct root causes more efficiently.


This series of articles comes out of our experience monitoring large-scale infrastructure for our customers. It also draws on the work of Brendan Gregg, Rob Ewaschuk, and Baron Schwartz.

このシリーズの内容は、Datadogがお客様の大規模インフラを監視してきた経験を基に、次に紹介するよ人たちのブログ記事 Brendan GreggRob

A word about data

metric types

There are three main types of monitoring data that can help you investigate the root causes of problems in your infrastructure. Data types and best practices for their collection are discussed in-depth in a companion post, but in short:

  • Work metrics indicate the top-level health of your system by measuring its useful output
  • Resource metrics quantify the utilization, saturation, errors, or availability of a resource that your system depends on
  • Events describe discrete, infrequent occurrences in your system such as code changes, internal alerts, and scaling events



  • Work metrics 運用しているシステム上で動作しているサービスのパフーマンスを表しています。
  • Resource metrics 運用しているシステムが依存するリソースの使用率、飽和レベル、エラー、または可用性など。
  • Events コードの変更、内部アラート、台数変化に関するイベントなど、システム内で発生する非連続的で不定期な出来事

By and large, work metrics will surface the most serious symptoms and should therefore generate the most serious alerts. But the other metric types are invaluable for investigating the causes of those symptoms.

大方の場合、work metrics は、重大な問題の兆候を表面化させます。従って、work metrics は、最も重大な障害用のアラートを発生させる必要があります。

It’s resources all the way down

metric uses

Most of the components of your infrastructure can be thought of as resources. At the highest levels, each of your systems that produces useful work likely relies on other systems. For instance, the Apache server in a LAMP stack relies on a MySQL database as a resource to support its work. One level down, within MySQL are database-specific resources that MySQL uses to do its work, such as the finite pool of client connections. At a lower level still are the physical resources of the server running MySQL, such as CPU, memory, and disks.


Thinking about which systems produce useful work, and which resources support that work, can help you to efficiently get to the root of any issues that surface. When an alert notifies you of a possible problem, the following process will help you to approach your investigation systematically.


recursive investigation

1. Start at the top with work metrics

First ask yourself, “Is there a problem? How can I characterize it?” If you don’t describe the issue clearly at the outset, it’s easy to lose track as you dive deeper into your systems to diagnose the issue.


Next examine the work metrics for the highest-level system that is exhibiting problems. These metrics will often point to the source of the problem, or at least set the direction for your investigation. For example, if the percentage of work that is successfully processed drops below a set threshold, diving into error metrics, and especially the types of errors being returned, will often help narrow the focus of your investigation. Alternatively, if latency is high, and the throughput of work being requested by outside systems is also very high, perhaps the system is simply overburdened.


2. Dig into resources

If you haven’t found the cause of the problem by inspecting top-level work metrics, next examine the resources that the system uses—physical resources as well as software or external services that serve as resources to the system. If you’ve already set up dashboards for each system as outlined below, you should be able to quickly find and peruse metrics for the relevant resources. Are those resources unavailable? Are they highly utilized or saturated? If so, recurse into those resources and begin investigating each of them at step 1.


3. Did something change?

Next consider alerts and other events that may be correlated with your metrics. If a code release, internal alert, or other event was registered slightly before problems started occurring, investigate whether they may be connected to the problem.


4. Fix it (and don’t forget it)

Once you have determined what caused the issue, correct it. Your investigation is complete when symptoms disappear—you can now think about how to change the system to avoid similar problems in the future.


Build dashboards before you need them


In an outage, every minute is crucial. To speed your investigation and keep your focus on the task at hand, set up dashboards in advance. You may want to set up one dashboard for your high-level application metrics, and one dashboard for each subsystem. Each system’s dashboard should render the work metrics of that system, along with resource metrics of the system itself and key metrics of the subsystems it depends on. If event data is available, overlay relevant events on the
graphs for correlation analysis.


Conclusion: Follow the metrics

Adhering to a standardized monitoring framework allows you to investigate problems more systematically:

  • For each system in your infrastructure, set up a dashboard ahead of time that displays all its key metrics, with relevant events overlaid.
  • Investigate causes of problems by starting with the highest-level system that is showing symptoms, reviewing its work and resource metrics and any associated events.
  • If problematic resources are detected, apply the same investigation pattern to the resource (and its constituent resources) until your root problem is discovered and corrected.


  • インフラの各システムに対し、キーメトリクスとシステムに関連したイベントを重ね書きしたダッシュボードを、予め準備しておきましょう。
  • 問題の原因の調査、検討は、最もユーザーに近い層のシステムの徴候から始め、workメトリクスとresourceメトリクスを検討し、それらに関連しているイベントを検証しましょう。
  • リソースに問題があることが分かった場合は、リソースに対しこれまでと同じ調査、検討パターンを適応していきます。その下のサブリソースにも同様にこのパターンを適用し、根本原因を発見し、修正が終わるまで続けます。

We would like to hear about your experiences as you apply this framework to your own monitoring practice. If it is working well, please let us know on Twitter! Questions, corrections, additions, complaints, etc? Please let us know on GitHub.

独自に実践していた監視体制にこのフレームワークで学んだことを取り入れ、新たな監視体制に取り組んだ体験を是非お聞かせください。 フレームワークを取り入れることによって監視が改善された場合は、Twitterで@datadoghq付きでつぶやいていただけると幸いです。また、質問、修正、追加、苦情、その他がある場合は、Githubのissueにて連絡してください。

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
What you can do with signing up