性能解析では、CPU 使用率は top を見よう

Last updated at 2025-12-14Posted at 2025-12-14

この記事は NTTドコモソリューションズ AdventCalendar 2025 の 15 日目の記事です。

普段は RHEL の問合せサポートやカーネル関連の技術調査をしています。週に 1 回はリソース情報の解析依頼が担当に来ますので、今回もリソース情報について記載します。

ある日のこと

性能解析をしてると、アドバイスを貰いました。

「04:00 - 04:10 の間の CPU 高騰原因を ps で調査してるけど top 使ったほうがいいですよ」

どういうこと?

ps は特定時間の性能解析に向いてません。

次のように CPU 使用率を計算しているためです。

ps プロセス起動時間の平均 CPU 使用率
top top 実行間隔の平均 CPU 使用率

man の記載は、次の通りです。

   %cpu        %CPU      cpu utilization of the process in "##.#" format.  Currently, it is the CPU time used divided by the time the process has been running
                         (cputime/realtime ratio), expressed as a percentage.  It will not add up to 100% unless you are lucky.  (alias pcpu).

出典：man ps(1)

    1. %CPU  --  CPU Usage
      The task's share of the elapsed CPU time since the last screen update, expressed as a percentage of total CPU time.

      In a true SMP environment, if a process is multi-threaded and top is not operating in Threads mode, amounts greater than 100% may be reported.   You  toggle
      Threads mode with the `H' interactive command.

出典：man top(1)　

ps で何が困るのか?

ps はプロセス起動時間の平均です。そのため、特定時間では忙しいけど他はそんなに忙しくないプロセスでは、CPU 使用率を誤認する恐れがあります。実際に動作を確認してみます。

まず 10 秒処理、10秒停止を 3 回繰り返すプロセスを用意します。

busy.sh

#!/bin/bash

function _loop() {
    kill -SIGCONT "$pid"
    echo "running"
    sleep 10

    kill -SIGSTOP "$pid"
    echo "stopping"
    sleep 10
}

function _cleanup() {
    kill "$pid"
}

yes > /dev/null &
pid=$!
trap _cleanup EXIT
seq 1 3 | while read -r i
do
    _loop "$i"
done

ps と top それぞれで測定します。

measure.sh

#!/bin/bash

pid="$(pidof yes)"

if [ "$1" == "ps" ];then
    seq 1 60 | while read -r i
    do
        ps -p "$pid" -o %cpu=
        sleep 1
    done
elif [ "$1" == "top" ];then
    top -d 1 -n 60 -p "$pid" -b | grep --line-buffered "$pid" | awk '{print $9}'
fi

結果をグラフにします。どちらも 1 秒ごとに取得したデータですが、

top はプロセスの処理量の変化を捉えられている一方、

ps はプロセス停止しているのを捉えられていません。
(ps の CPU 使用率はプロセス起動時間の平均だから)

長いこと起動しているプロセスはとくに、ps コマンドでは CPU 使用率を誤認しやすいので注意しましょう。もちろん top も取得間隔が長ければ同様の問題がありますが、デフォルトで 3 秒間隔なので、ハマることは少ないと思います。いずれにせよ、どの時点の平均なのかを意識することが重要ですね、という内容でした。

補足

「top は実行間隔を指定するから、その間隔の平均取るってこと？じゃあ 1 回だけ実行する top を毎回起動したらどうなるの？」と思った、そこのあなた(わたし)。

top -b -n 1 してみましょう。結果がこちらです。

変化に敏感に追従しています。どうやら top 実行時に、初回は少しだけ間をあけて高速に 2 回読んで、その間隔の平均を取っているようです。

$ strace -ttf -e openat top -b -n 1 -p $(pidof yes)  2>&1 | grep "/proc/$(pidof yes)/stat"
15:47:29.947750 openat(AT_FDCWD, "/proc/777013/stat", O_RDONLY) = 6
15:47:29.947869 openat(AT_FDCWD, "/proc/777013/statm", O_RDONLY) = 6
15:47:30.101337 openat(AT_FDCWD, "/proc/777013/stat", O_RDONLY) = 7
15:47:30.101520 openat(AT_FDCWD, "/proc/777013/statm", O_RDONLY) = 7

# 1 回以上実行するときも、1 回目は同様の挙動をしているようです
$ strace -ttf -e openat top -b -n 2 -p $(pidof yes)  2>&1 | grep "/proc/$(pidof yes)/stat"
15:49:30.959776 openat(AT_FDCWD, "/proc/777013/stat", O_RDONLY) = 6
15:49:30.959896 openat(AT_FDCWD, "/proc/777013/statm", O_RDONLY) = 6
15:49:31.113038 openat(AT_FDCWD, "/proc/777013/stat", O_RDONLY) = 7
15:49:31.113217 openat(AT_FDCWD, "/proc/777013/statm", O_RDONLY) = 7
# 間が空いて以下が出力される
15:49:34.122824 openat(AT_FDCWD, "/proc/777013/stat", O_RDONLY) = 8
15:49:34.123518 openat(AT_FDCWD, "/proc/777013/statm", O_RDONLY) = 8

記載されている会社名、製品名、サービス名は、各社の商標または登録商標です。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up