Linux
Troubleshooting
ACPI

HP サーバで syslog に ACPI Error が定期的に出力される

Environment

# cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION="Ubuntu 14.04.2 LTS"

# dmidecode -s system-product-name
ProLiant ML30 Gen9

Xenial にも対応してほしい......

エラー内容

これが /var/log/syslog に延々と記録される......

[ 1866.577934] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20140424/exfield-420)
[ 1866.577949] ACPI Error: Method parse/execution failed [\_SB_.PMI0._PMM] (Node ffff88103f0451b8), AE_AML_BUFFER_LIMIT (20140424/psparse-536)
[ 1866.577971] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20140424/power_meter-338)

TL;DR

カーネルモジュールの acpi_power_meter を恒久的に読み込まれないようにする。

echo "blacklist acpi_power_meter" >> /etc/modprobe.d/hwmon.conf

Frequent ACPI errors in dmesg ring buffer · Issue #827 · firehol/netdata

調査ログ

エラーの文言から調べた所、下記のブログ記事がヒットし、手順の通りコマンドを実行。
(RHEL/CentOS カテゴリだったが、ACPI とかカーネルレベルの問題だと思うので dist による違いは無いはず)。

KERNEL ACPI ERROR SMBUS/IPMI/GENERICSERIALBUS

cat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/power1_average を実行した際、 /var/log/syslog に当該ログが出力されていることも確認。

# find /sys/devices/LNXSYSTM\:00/ |grep ACPI000D
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/hid
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/power1_average_interval
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/name
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/path
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/power1_average
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/hwmon
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/hwmon/hwmon0
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/hwmon/hwmon0/power
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/hwmon/hwmon0/power/control
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/hwmon/hwmon0/power/async
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/hwmon/hwmon0/power/runtime_enabled
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/hwmon/hwmon0/power/runtime_active_kids
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/hwmon/hwmon0/power/runtime_active_time
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/hwmon/hwmon0/power/autosuspend_delay_ms
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/hwmon/hwmon0/power/runtime_status
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/hwmon/hwmon0/power/runtime_usage
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/hwmon/hwmon0/power/runtime_suspended_time
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/hwmon/hwmon0/device
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/hwmon/hwmon0/subsystem
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/hwmon/hwmon0/uevent
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/power
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/power/control
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/power/async
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/power/runtime_enabled
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/power/runtime_active_kids
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/power/runtime_active_time
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/power/autosuspend_delay_ms
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/power/runtime_status
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/power/runtime_usage
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/power/runtime_suspended_time
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/power1_model_number
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/modalias
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/power1_average_interval_max
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/power1_average_interval_min
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/driver
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/power1_oem_info
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/subsystem
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/status
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/uevent
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/physical_node
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/power1_accuracy
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/power1_serial_number
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/measures
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/measures/LNXSYBUS:00
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/power1_is_battery

# cat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/power1_average
0

# apt-get install -y lm-sensors        # sensors コマンドが無かったのでインストール

# sensors
power_meter-acpi-0
Adapter: ACPI interface

coretemp-isa-0000
Adapter: ISA adapter
Physical id 0:  +37.0°C  (high = +80.0°C, crit = +100.0°C)
Core 0:         +37.0°C  (high = +80.0°C, crit = +100.0°C)
Core 1:         +36.0°C  (high = +80.0°C, crit = +100.0°C)
Core 2:         +38.0°C  (high = +80.0°C, crit = +100.0°C)
Core 3:         +35.0°C  (high = +80.0°C, crit = +100.0°C)

tg3-pci-0200
Adapter: PCI adapter
temp1:         +0.1°C  (high =  +0.1°C, crit =  +0.1°C)

# vim /etc/sensors.d/hp-proliant-ml60.conf        # 下記内容を追記して保存

# cat /etc/sensors.d/hp-proliant-ml60.conf 
chip "power_meter-acpi-0"
    ignore power1

# service lm-sensors restart

一旦これで解決したかと思ったが、再起動後に同じログが流れ始める。

更に調査し、下記 issue で解決した。

https://github.com/firehol/netdata/issues/827#issuecomment-300494939

$ sudo modprobe -r acpi_power_meter # 一時的に解決する
$ sudo echo "blacklist acpi_power_meter" >> /etc/modprobe.d/hwmon.conf # 恒久的に解決する

おわり。