Help us understand the problem. What is going on with this article?

【CentOS7】HDDに異常がでたらsmartdでメールを飛ばしたいときのメモ

More than 3 years have passed since last update.

Self-Monitoring, Analysis and Reporting Technology
https://ja.wikipedia.org/wiki/Self-Monitoring,_Analysis_and_Reporting_Technology

■ 概要

HDDの健康状態を診断するための情報であるSMARTを定期的に取得し、異常があった場合は指定の宛先にメールを送る設定を行う。

■ smartctlの導入

▼ 1. smartctlのインストール

# yum list installed | grep smartmontools

インストールされていなかったら下記コマンドを実行

# yum install smartmontools

▼ 2. サービスの登録と起動

# systemctl enable smartd.service
# systemctl start smartd.service

▼ 3. HDDの確認

● サーバが認識しているHDDの一覧を取得する

# smartctl --scan
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device

● 一覧からそれぞれのHDDの詳細を確認する

# smartctl /dev/sda -i
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-327.22.2.el7.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Blue (SATA 6Gb/s)
Device Model:     WDC WD10EZEX-XXXXXXX
Serial Number:    WD-XXXXXXXXXXXX
LU WWN Device Id: X XXXXXX XXXXXXXXX
Firmware Version: 80.00A80
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Tue Aug  2 09:34:05 2016 JST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMARTに対応していると、表示結果に以下のような行が含まれている。

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

なお上のような行がなくSMARTが有効でない場合、以下を実行すれば有効化できることもあるらしい

# smartctl -s on /dev/<device>

● HDDが対応しているテストの種類と所要時間を確認する

# smartctl /dev/sdb -c
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-327.22.2.el7.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (10620) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 122) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x30b5) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

テストの種類はShort, Extended, Conveyanceの3種類がある
例示したHDDはすべてのテストに対応していた

▼ 4 .スキャンを行う

● スキャン前のログを確認する

# smartctl -l selftest /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-327.22.2.el7.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

● スキャンを実行する

# smartctl -t short /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-327.22.2.el7.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Tue Aug  2 11:09:14 2016

Use smartctl -X to abort test.

● スキャン後のログを確認する

# smartctl -l selftest /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-327.22.2.el7.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     20894         -

▼ 5. メールを送れるか確認する

● テストメールを送るために設定を変更する

# vi /etc/smartmontools/smartd.conf
smartd.conf
DEVICESCAN -H -m root -M exec /usr/libexec/smartmontools/smartdnotify -n standby,10,q

↓ 23行目付近の上掲の記述を削除かコメントアウトして新たにテスト用の記述を追加する

smartd.conf
#DEVICESCAN -H -m root -M exec /usr/libexec/smartmontools/smartdnotify -n standby,10,q
DEVICESCAN -m example@example.com -M test

※smartd.confのもっとも上に書かれたDEVICESCANの設定しか反映されないようなので、
デフォルトで書かれている設定をコメントアウトせずに下に書き足しても動かない

● サービスを再読み込みしてメールが送られていたらOK

# systemctl reload smartd.service

▼ 6. スキャンの実行計画を設定する

smartd.confを読みつつ適宜設定

smartd.conf
DEVICESCAN -a -o on -S on -n standby,q -s (S/../.././03|L/../../6/03) -W 5,50,55 -m example@example.com

▼ 7. 異常が発生したときの状態確認

あるときファイル共有のためのディレクトリとしてマウントされていた/dev/sdbについて、
アクセスができない、またはアクセスに時間を要するようになったという報告があった。
※ちなみにroot宛に下記ようなメールが来ていた

Device: /dev/sdb [SAT], 120 Currently unreadable (pending) sectors

スキャンを行い結果を確認する

# smartctl -t short /dev/sdb
# smartctl -l selftest /dev/sdb
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-327.28.3.el7.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     11638         1922090128
# 2  Short captive       Completed without error       00%         5         -

Statusがread failureなってしまっている
次のようなコマンドでより詳細な状態を確認する(下記状態は一部抜粋)

# smartctl -a /dev/sdb | less
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   068   068   016    Pre-fail  Always       -       22090829
  2 Throughput_Performance  0x0005   142   142   054    Pre-fail  Offline      -       70
  3 Spin_Up_Time            0x0007   129   129   024    Pre-fail  Always       -       175 (Average 180)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       261
  5 Reallocated_Sector_Ct   0x0033   001   001   005    Pre-fail  Always   FAILING_NOW 1990
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   115   115   020    Pre-fail  Offline      -       34
  9 Power_On_Hours          0x0012   099   099   000    Old_age   Always       -       11639
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       261
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       587
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       587
194 Temperature_Celsius     0x0002   253   253   000    Old_age   Always       -       23 (Min/Max 15/45)
196 Reallocated_Event_Count 0x0032   001   001   000    Old_age   Always       -       2632
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       120
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       4

Reallocated_Sector_Ct(代替処理済みの不良セクタ)がFAILING_NOWしていることがわかる

■ 参考URL

https://wiki.archlinuxjp.org/index.php/S.M.A.R.T.
https://www.smartmontools.org/browser/trunk/smartmontools/smartd.8.in

標準でSMARTの異常を検出してLEDがぴこぴこするようなサーバをみんなが使えるわけでもないし、
何をもってして異常と判断してメールをくれるのかは謎だけど、ないよりはきっといいよね

rhap
Why not register and get more from Qiita?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away