S.M.A.R.T.ってなんだ？〜ストレージの健康診断技術を完全理解〜

Posted at 2026-01-31

この記事の対象読者

SSDやHDDの健康状態をチェックしたいエンジニア
サーバー運用でストレージ監視を担当しているインフラエンジニア
「CrystalDiskInfo」の数値の意味がよくわからない方
データ損失を未然に防ぎたいすべての開発者

この記事で得られること

S.M.A.R.T.の仕組み: なぜドライブは自分の健康状態を知れるのか、技術的背景を理解
重要な属性の読み方: Backblaze社の研究に基づく「本当に見るべき5つの指標」
HDD vs SSD vs NVMeの違い: デバイスタイプ別の監視ポイントを把握
実践的な監視: Python/smartmontoolsで自動化する方法とアラート設定

この記事で扱わないこと

RAIDコントローラー配下のドライブ監視（別途考慮が必要）
特定メーカーのベンダー固有属性の詳細
データ復旧の手順（故障後の対応）

1. S.M.A.R.T.との出会い

「昨日まで普通に動いてたSSDが、今朝突然認識しなくなった」

エンジニアなら誰もが恐れるこの悪夢。私も過去に一度、開発用マシンのSSDが突然死して、1週間分のコードを失った経験がある。バックアップはあったが、直近のコミット前の変更は戻らなかった。

その後、二度と同じ思いをしないために調べ始めたのが S.M.A.R.T. だった。

S.M.A.R.T.は Self-Monitoring, Analysis and Reporting Technology（自己監視・分析・報告技術）の略で、ストレージデバイスに内蔵された「健康診断システム」だ。人間で例えるなら、血圧や心拍数を常時測定して、異常があれば警告してくれる健康モニターのようなもの。

ただし、重要な注意点がある。S.M.A.R.T.は「予測可能な故障」しか検出できない。落雷や停電によるサージ、物理的な衝撃による突然死は予測不可能だ。それでも、Googleの研究によると「故障の約64%は何らかのS.M.A.R.T.警告を伴っていた」という。つまり、監視する価値は十分にある。

ここまでで、S.M.A.R.T.がどんなものか、なんとなくイメージできただろうか。次は、この記事で使う用語を整理しておこう。

2. 前提知識の確認

本題に入る前に、この記事で登場する用語を確認する。

2.1 属性（Attribute）とは

S.M.A.R.T.における属性は、ドライブの特定の健康指標を表す項目のこと。例えば「温度」「稼働時間」「エラー回数」など。各属性には以下のフィールドがある。

フィールド	説明
ID	属性の識別番号（1〜255）
Current	現在の正規化値（通常100が最良、低いほど悪化）
Worst	過去最悪の正規化値
Threshold	閾値（この値を下回ると「故障」判定）
Raw Value	生の測定値（実際の数値、例: 温度なら摂氏度数）
Type	Pre-fail（故障予兆）または Old-age（経年劣化）

2.2 Pre-fail属性とOld-age属性の違い

S.M.A.R.T.属性は大きく2種類に分類される。

Pre-fail（故障予兆）属性: 閾値を下回ると、24時間以内に故障する可能性が高いことを示す。緊急度が高い。

Old-age（経年劣化）属性: 正常な使用による経年劣化を示す。閾値を下回っても即座に故障するわけではないが、交換を検討すべき。

2.3 閾値（Threshold）とは

メーカーが設定した「これを下回ったらマズい」というラインのこと。正規化された現在値がこの閾値を下回ると、S.M.A.R.T.ステータスは「FAILED」となり、BIOSやOSが警告を表示する。

ただし注意点として、閾値はメーカーによって異なり、生値（Raw Value）の解釈もメーカー依存だ。同じ属性IDでも、SeagateとWestern Digitalで意味が違うことがある。

これらの用語が押さえられたら、S.M.A.R.T.の背景を見ていこう。

3. S.M.A.R.T.が生まれた背景

3.1 誕生の歴史

S.M.A.R.T.の起源は 1992年、IBMのPredictive Failure Analysis（PFA） に遡る。IBMはAS/400サーバー向けの9337ディスクアレイで、初めてドライブの自己診断機能を実装した。

当時のPFAは非常にシンプルで、ホストに送信できる情報は「OK」か「故障しそう」の二値だけだった。

その後、1995年にCompaq、Seagate、Quantum、Connerが「IntelliSafe」 という規格を策定。これが現在のS.M.A.R.T.の直接の祖先となる。IntelliSafeでは、複数の属性値をホストに送信できるようになり、より詳細な監視が可能になった。

1996年にATA-3規格でS.M.A.R.T.が標準化。これにより、メーカーを問わず同じコマンドでS.M.A.R.T.データを取得できるようになった。

3.2 なぜS.M.A.R.T.が必要だったのか

1990年代、HDDは急速に大容量化・高速化していった。同時に、データ損失のリスクも増大した。

年代	典型的なHDD容量	故障時の損失
1990年	40MB	数日分のデータ
1995年	1GB	数週間分のデータ
2000年	20GB	数ヶ月分のデータ
2010年	1TB	数年分のデータ

容量が増えるほど、突然の故障による損失は甚大になる。「故障する前に警告を出す」仕組みが切実に求められた。

また、HDDは精密機械であり、経年劣化は避けられない。モーターの磨耗、ヘッドの位置ずれ、プラッタ表面の劣化など、徐々に進行する問題は監視によって検出できる可能性があった。

背景がわかったところで、基本的な仕組みを見ていこう。

4. 基本概念と仕組み

4.1 S.M.A.R.T.のアーキテクチャ

S.M.A.R.T.は、ドライブ内蔵のファームウェアとホストソフトウェアの協調で動作する。

【S.M.A.R.T.の動作フロー】

┌─────────────────────────────────────────────────────┐
│                    ストレージデバイス                  │
│  ┌─────────────┐    ┌─────────────┐                 │
│  │  センサー群   │───▶│ ファームウェア │                 │
│  │ ・温度センサー │    │ ・データ収集   │                 │
│  │ ・エラーカウンタ│    │ ・属性計算    │                 │
│  │ ・稼働時間計  │    │ ・閾値判定    │                 │
│  └─────────────┘    └──────┬──────┘                 │
│                           │                        │
│                    ┌──────▼──────┐                 │
│                    │ S.M.A.R.T.   │                 │
│                    │   データ領域   │                 │
│                    └──────┬──────┘                 │
└───────────────────────────┼─────────────────────────┘
                            │ ATA/NVMe コマンド
                            ▼
┌─────────────────────────────────────────────────────┐
│                    ホストシステム                      │
│  ┌─────────────┐    ┌─────────────┐                 │
│  │  OS/BIOS    │───▶│ 監視ソフト   │                 │
│  │ ・smartctl  │    │ ・警告表示   │                 │
│  │ ・nvme-cli  │    │ ・ログ記録   │                 │
│  └─────────────┘    └─────────────┘                 │
└─────────────────────────────────────────────────────┘

4.2 HDD固有のS.M.A.R.T.属性

HDDは機械的な可動部品を持つため、S.M.A.R.T.属性の多くは機械的な健康状態を反映する。

属性ID	属性名	説明	警戒すべき状況
5	Reallocated Sector Count	不良セクタを代替領域に再配置した回数	0以外は要注意
7	Seek Error Rate	ヘッドのシーク失敗率	急増は故障予兆
10	Spin Retry Count	スピンドル起動の再試行回数	1以上は要注意
187	Reported Uncorrectable Errors	訂正不能エラーの報告回数	0以外は危険
188	Command Timeout	コマンドタイムアウト回数	増加は故障予兆
197	Current Pending Sector Count	読み取り不能で代替待ちのセクタ数	0以外は要注意
198	Offline Uncorrectable	オフラインスキャンで見つかった不良セクタ数	0以外は危険

4.3 SSD固有のS.M.A.R.T.属性

SSDはNANDフラッシュを使用するため、機械的な属性は存在しない。代わりに、フラッシュメモリの寿命に関する属性が重要になる。

属性ID	属性名	説明	警戒すべき状況
5	Reallocated NAND Block Count	不良ブロックを代替した回数	急増は寿命接近
177	Wear Leveling Count	ウェアレベリングの均一度	低下は寿命接近
179	Used Reserved Block Count	使用済み予備ブロック数	上限接近は危険
180	Unused Reserved Block Count	残り予備ブロック数	0接近は危険
231	SSD Life Left / Temperature	残り寿命（%）または温度	10%以下は交換推奨
232	Available Reservd Space	利用可能な予備領域	低下は寿命接近
233	Media Wearout Indicator	NANDの磨耗度	0接近は交換必須
241	Total LBAs Written	総書き込み量（LBA単位）	TBW上限に注意

4.4 NVMe固有のS.M.A.R.T.ログ

NVMeはATAとは異なるプロトコルを使用するため、S.M.A.R.T.の形式も異なる。NVMe規格では SMART / Health Information Log（Log Page 02h） として標準化されている。

フィールド	説明	警戒すべき状況
Critical Warning	重大な警告フラグ（ビットマスク）	0以外は即対応
Temperature	現在温度（ケルビン）	70°C以上は危険
Available Spare	残り予備容量（%）	10%以下は交換推奨
Available Spare Threshold	予備容量の閾値（%）	-
Percentage Used	推定寿命消費率（%）	100%超えは交換推奨
Data Units Read/Written	読み書きデータ量	TBW計算に使用
Power On Hours	稼働時間	参考値
Unsafe Shutdowns	不正なシャットダウン回数	多いと寿命に影響
Media Errors	メディアエラー数	0以外は要注意

NVMeの大きな利点は、属性が規格で標準化されていること。メーカー間での解釈の違いが少なく、監視が容易だ。

基本概念が理解できたところで、実際にコードを書いて動かしてみよう。

5. 実践：S.M.A.R.T.を監視しよう

5.1 環境構築

# Linux (Debian/Ubuntu)
sudo apt-get update
sudo apt-get install smartmontools nvme-cli

# Linux (RHEL/CentOS/Fedora)
sudo dnf install smartmontools nvme-cli

# macOS
brew install smartmontools

# Windows
# smartmontools: https://www.smartmontools.org/wiki/Download からインストーラーをダウンロード
# または winget を使用
winget install smartmontools

5.2 環境別の設定ファイル

開発環境用（config.yaml）

# config.yaml - 開発環境用（ローカルPC向け）
environment: development

monitoring:
  interval_seconds: 300  # 5分ごとにチェック
  log_level: DEBUG

# Backblaze研究に基づく重要属性（HDD向け）
hdd_critical_attributes:
  - id: 5    # Reallocated Sector Count
    name: "Reallocated Sector Count"
    warning_threshold: 1
    critical_threshold: 10
  - id: 187  # Reported Uncorrectable Errors
    name: "Reported Uncorrectable Errors"
    warning_threshold: 1
    critical_threshold: 1
  - id: 188  # Command Timeout
    name: "Command Timeout"
    warning_threshold: 1
    critical_threshold: 10
  - id: 197  # Current Pending Sector Count
    name: "Current Pending Sector Count"
    warning_threshold: 1
    critical_threshold: 5
  - id: 198  # Offline Uncorrectable
    name: "Offline Uncorrectable"
    warning_threshold: 1
    critical_threshold: 1

# SSD向け属性
ssd_critical_attributes:
  - id: 177  # Wear Leveling Count
    name: "Wear Leveling Count"
    warning_threshold: 20  # 残り20%以下で警告
    critical_threshold: 10
  - id: 231  # SSD Life Left
    name: "SSD Life Left"
    warning_threshold: 20
    critical_threshold: 10

# NVMe向け属性
nvme_critical_attributes:
  - name: "percentage_used"
    warning_threshold: 80
    critical_threshold: 95
  - name: "available_spare"
    warning_threshold: 20
    critical_threshold: 10

temperature:
  warning_celsius: 55
  critical_celsius: 65

alerts:
  enabled: false
  
output:
  format: console
  verbose: true

本番環境用（config.production.yaml）

# config.production.yaml - 本番サーバー向け
environment: production

monitoring:
  interval_seconds: 3600  # 1時間ごとにチェック
  log_level: INFO

hdd_critical_attributes:
  - id: 5
    name: "Reallocated Sector Count"
    warning_threshold: 1
    critical_threshold: 5
  - id: 187
    name: "Reported Uncorrectable Errors"
    warning_threshold: 1
    critical_threshold: 1
  - id: 188
    name: "Command Timeout"
    warning_threshold: 1
    critical_threshold: 5
  - id: 197
    name: "Current Pending Sector Count"
    warning_threshold: 1
    critical_threshold: 3
  - id: 198
    name: "Offline Uncorrectable"
    warning_threshold: 1
    critical_threshold: 1

ssd_critical_attributes:
  - id: 177
    name: "Wear Leveling Count"
    warning_threshold: 30
    critical_threshold: 15
  - id: 231
    name: "SSD Life Left"
    warning_threshold: 30
    critical_threshold: 15

nvme_critical_attributes:
  - name: "percentage_used"
    warning_threshold: 70
    critical_threshold: 90
  - name: "available_spare"
    warning_threshold: 30
    critical_threshold: 15

temperature:
  warning_celsius: 50
  critical_celsius: 60

alerts:
  enabled: true
  slack_webhook: ${SLACK_WEBHOOK_URL}
  pagerduty_key: ${PAGERDUTY_SERVICE_KEY}
  email: ${ALERT_EMAIL}
  
output:
  format: json
  file_path: /var/log/smart_monitor.log
  verbose: false

テスト環境用（config.test.yaml）

# config.test.yaml - CI/CD・単体テスト用
environment: test

monitoring:
  interval_seconds: 1
  log_level: DEBUG
  mock_enabled: true

hdd_critical_attributes:
  - id: 5
    name: "Reallocated Sector Count"
    warning_threshold: 1
    critical_threshold: 1
  - id: 197
    name: "Current Pending Sector Count"
    warning_threshold: 1
    critical_threshold: 1

ssd_critical_attributes:
  - id: 177
    name: "Wear Leveling Count"
    warning_threshold: 50
    critical_threshold: 30

nvme_critical_attributes:
  - name: "percentage_used"
    warning_threshold: 50
    critical_threshold: 80

temperature:
  warning_celsius: 40
  critical_celsius: 50

alerts:
  enabled: false
  
output:
  format: json
  file_path: /tmp/smart_monitor_test.log
  verbose: true

5.3 S.M.A.R.T.監視スクリプト

#!/usr/bin/env python3
"""
S.M.A.R.T.監視スクリプト
HDD/SSD/NVMeの健康状態を監視し、故障を予測する

実行方法（Linux/macOS - 要root権限）:
    sudo python smart_monitor.py

実行方法（Windows - 管理者権限）:
    python smart_monitor.py

依存関係:
    - smartmontools がインストールされていること
    - nvme-cli がインストールされていること（NVMe監視時）
"""

import subprocess
import json
import sys
import re
from dataclasses import dataclass, field
from typing import Optional, Dict, List, Any
from datetime import datetime
from enum import Enum


class DeviceType(Enum):
    """ストレージデバイスの種類"""
    HDD = "hdd"
    SSD = "ssd"
    NVME = "nvme"
    UNKNOWN = "unknown"


class HealthStatus(Enum):
    """健康状態"""
    GOOD = "good"           # 正常
    WARNING = "warning"     # 警告（要観察）
    CRITICAL = "critical"   # 危険（即時対応）
    UNKNOWN = "unknown"     # 不明


@dataclass
class SMARTAttribute:
    """S.M.A.R.T.属性"""
    id: int
    name: str
    current: int
    worst: int
    threshold: int
    raw_value: int
    attr_type: str  # "Pre-fail" or "Old_age"
    
    @property
    def is_failing(self) -> bool:
        """閾値を下回っているか"""
        return self.current <= self.threshold and self.threshold > 0


@dataclass
class NVMEHealthInfo:
    """NVMe健康情報"""
    critical_warning: int
    temperature_celsius: int
    available_spare: int
    available_spare_threshold: int
    percentage_used: int
    data_units_read: int
    data_units_written: int
    power_on_hours: int
    unsafe_shutdowns: int
    media_errors: int


@dataclass
class DriveHealth:
    """ドライブの健康状態"""
    device: str
    model: str
    serial: str
    device_type: DeviceType
    capacity_gb: float
    smart_status: str  # "PASSED" or "FAILED"
    health_status: HealthStatus
    temperature_celsius: int
    power_on_hours: int
    smart_attributes: List[SMARTAttribute] = field(default_factory=list)
    nvme_health: Optional[NVMEHealthInfo] = None
    warnings: List[str] = field(default_factory=list)
    timestamp: str = field(default_factory=lambda: datetime.now().isoformat())


# Backblaze研究に基づく故障予測に重要な属性ID
BACKBLAZE_CRITICAL_ATTRIBUTES = {
    5: "Reallocated Sector Count",
    187: "Reported Uncorrectable Errors",
    188: "Command Timeout",
    197: "Current Pending Sector Count",
    198: "Offline Uncorrectable",
}


def run_command(cmd: List[str], timeout: int = 30) -> Optional[str]:
    """
    コマンドを実行して出力を取得
    
    Args:
        cmd: 実行するコマンドとその引数のリスト
        timeout: タイムアウト秒数
    
    Returns:
        コマンドの標準出力、失敗時はNone
    """
    try:
        result = subprocess.run(
            cmd,
            capture_output=True,
            text=True,
            timeout=timeout
        )
        return result.stdout
    except FileNotFoundError:
        return None
    except subprocess.TimeoutExpired:
        return None


def detect_device_type(device: str, smart_output: str) -> DeviceType:
    """
    デバイスタイプを検出
    
    Args:
        device: デバイスパス
        smart_output: smartctlの出力
    
    Returns:
        DeviceType
    """
    if "/nvme" in device:
        return DeviceType.NVME
    
    if smart_output:
        lower_output = smart_output.lower()
        if "solid state device" in lower_output or "ssd" in lower_output:
            return DeviceType.SSD
        if "rotation rate" in lower_output:
            # 回転数の記載があればHDD
            if "solid state" not in lower_output:
                return DeviceType.HDD
    
    return DeviceType.UNKNOWN


def parse_smart_attributes(output: str) -> List[SMARTAttribute]:
    """
    smartctl出力からS.M.A.R.T.属性をパース
    
    Args:
        output: smartctl -A の出力
    
    Returns:
        SMARTAttributeのリスト
    """
    attributes = []
    
    # 属性テーブルの行を探す
    # フォーマット: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
    pattern = r'^\s*(\d+)\s+(\S+)\s+0x[0-9a-f]+\s+(\d+)\s+(\d+)\s+(\d+)\s+(\S+)\s+\S+\s+\S+\s+(\d+)'
    
    for line in output.split('\n'):
        match = re.match(pattern, line)
        if match:
            attr = SMARTAttribute(
                id=int(match.group(1)),
                name=match.group(2),
                current=int(match.group(3)),
                worst=int(match.group(4)),
                threshold=int(match.group(5)),
                attr_type=match.group(6),
                raw_value=int(match.group(7))
            )
            attributes.append(attr)
    
    return attributes


def parse_nvme_health(device: str) -> Optional[NVMEHealthInfo]:
    """
    NVMeデバイスの健康情報を取得
    
    Args:
        device: デバイスパス（例: /dev/nvme0）
    
    Returns:
        NVMEHealthInfo または None
    """
    # smartctlでNVMe情報を取得
    output = run_command(['smartctl', '-A', device])
    if not output:
        # nvme-cliを試す
        output = run_command(['nvme', 'smart-log', device])
    
    if not output:
        return None
    
    # パース用の辞書
    data = {}
    
    # smartctl形式のパース
    patterns = {
        'critical_warning': r'Critical Warning:\s*0x([0-9a-fA-F]+)',
        'temperature': r'Temperature:\s*(\d+)',
        'available_spare': r'Available Spare:\s*(\d+)%?',
        'available_spare_threshold': r'Available Spare Threshold:\s*(\d+)%?',
        'percentage_used': r'Percentage Used:\s*(\d+)%?',
        'data_units_read': r'Data Units Read:\s*([\d,]+)',
        'data_units_written': r'Data Units Written:\s*([\d,]+)',
        'power_on_hours': r'Power On Hours:\s*([\d,]+)',
        'unsafe_shutdowns': r'Unsafe Shutdowns:\s*([\d,]+)',
        'media_errors': r'Media (?:and Data Integrity )?Errors:\s*(\d+)',
    }
    
    for key, pattern in patterns.items():
        match = re.search(pattern, output, re.IGNORECASE)
        if match:
            value = match.group(1).replace(',', '')
            if key == 'critical_warning':
                data[key] = int(value, 16)
            else:
                data[key] = int(value)
    
    if len(data) < 5:  # 十分なデータがない
        return None
    
    return NVMEHealthInfo(
        critical_warning=data.get('critical_warning', 0),
        temperature_celsius=data.get('temperature', 0),
        available_spare=data.get('available_spare', 100),
        available_spare_threshold=data.get('available_spare_threshold', 10),
        percentage_used=data.get('percentage_used', 0),
        data_units_read=data.get('data_units_read', 0),
        data_units_written=data.get('data_units_written', 0),
        power_on_hours=data.get('power_on_hours', 0),
        unsafe_shutdowns=data.get('unsafe_shutdowns', 0),
        media_errors=data.get('media_errors', 0)
    )


def analyze_health(
    device_type: DeviceType,
    smart_status: str,
    attributes: List[SMARTAttribute],
    nvme_health: Optional[NVMEHealthInfo],
    temperature: int
) -> tuple[HealthStatus, List[str]]:
    """
    健康状態を分析
    
    Args:
        device_type: デバイスタイプ
        smart_status: S.M.A.R.T.ステータス
        attributes: S.M.A.R.T.属性リスト
        nvme_health: NVMe健康情報
        temperature: 温度
    
    Returns:
        (HealthStatus, 警告メッセージのリスト)
    """
    warnings = []
    status = HealthStatus.GOOD
    
    # S.M.A.R.T.ステータスがFAILEDなら即CRITICAL
    if smart_status == "FAILED":
        return HealthStatus.CRITICAL, ["S.M.A.R.T.ステータスがFAILEDです。即時バックアップと交換を推奨します。"]
    
    # 温度チェック
    if temperature > 65:
        warnings.append(f"温度が危険レベルです: {temperature}°C > 65°C")
        status = HealthStatus.CRITICAL
    elif temperature > 55:
        warnings.append(f"温度が高めです: {temperature}°C > 55°C")
        if status != HealthStatus.CRITICAL:
            status = HealthStatus.WARNING
    
    # NVMeの場合
    if device_type == DeviceType.NVME and nvme_health:
        if nvme_health.critical_warning != 0:
            warnings.append(f"NVMe Critical Warningフラグが立っています: 0x{nvme_health.critical_warning:02x}")
            status = HealthStatus.CRITICAL
        
        if nvme_health.percentage_used >= 100:
            warnings.append(f"NVMe寿命消費率が100%を超えています: {nvme_health.percentage_used}%")
            status = HealthStatus.CRITICAL
        elif nvme_health.percentage_used >= 80:
            warnings.append(f"NVMe寿命消費率が高めです: {nvme_health.percentage_used}%")
            if status != HealthStatus.CRITICAL:
                status = HealthStatus.WARNING
        
        if nvme_health.available_spare <= nvme_health.available_spare_threshold:
            warnings.append(f"NVMe予備容量が閾値以下です: {nvme_health.available_spare}% <= {nvme_health.available_spare_threshold}%")
            status = HealthStatus.CRITICAL
        elif nvme_health.available_spare <= 20:
            warnings.append(f"NVMe予備容量が低下しています: {nvme_health.available_spare}%")
            if status != HealthStatus.CRITICAL:
                status = HealthStatus.WARNING
        
        if nvme_health.media_errors > 0:
            warnings.append(f"NVMeメディアエラーが検出されています: {nvme_health.media_errors}件")
            if status != HealthStatus.CRITICAL:
                status = HealthStatus.WARNING
    
    # HDD/SSDの場合（Backblaze研究に基づく重要属性をチェック）
    else:
        for attr in attributes:
            # 閾値を下回っている場合
            if attr.is_failing:
                warnings.append(f"{attr.name}(ID:{attr.id})が閾値を下回っています: {attr.current} <= {attr.threshold}")
                status = HealthStatus.CRITICAL
            
            # Backblaze重要属性のRaw値チェック
            if attr.id in BACKBLAZE_CRITICAL_ATTRIBUTES and attr.raw_value > 0:
                attr_name = BACKBLAZE_CRITICAL_ATTRIBUTES[attr.id]
                if attr.raw_value >= 10:
                    warnings.append(f"{attr_name}(ID:{attr.id})のRaw値が危険レベルです: {attr.raw_value}")
                    status = HealthStatus.CRITICAL
                else:
                    warnings.append(f"{attr_name}(ID:{attr.id})のRaw値が0以外です: {attr.raw_value}")
                    if status != HealthStatus.CRITICAL:
                        status = HealthStatus.WARNING
    
    return status, warnings


def get_drive_health(device: str) -> Optional[DriveHealth]:
    """
    ドライブの健康状態を取得
    
    Args:
        device: デバイスパス
    
    Returns:
        DriveHealth または None
    """
    # smartctlで情報を取得
    info_output = run_command(['smartctl', '-i', device])
    attr_output = run_command(['smartctl', '-A', device])
    health_output = run_command(['smartctl', '-H', device])
    
    if not info_output:
        print(f"警告: {device}からS.M.A.R.T.情報を取得できませんでした")
        return None
    
    # 基本情報のパース
    model_match = re.search(r'(?:Device Model|Model Number):\s*(.+)', info_output)
    serial_match = re.search(r'Serial Number:\s*(.+)', info_output)
    capacity_match = re.search(r'User Capacity:\s*([\d,]+)', info_output)
    
    model = model_match.group(1).strip() if model_match else "Unknown"
    serial = serial_match.group(1).strip() if serial_match else "Unknown"
    capacity_bytes = int(capacity_match.group(1).replace(',', '')) if capacity_match else 0
    capacity_gb = capacity_bytes / (1024 ** 3)
    
    # デバイスタイプの検出
    device_type = detect_device_type(device, info_output + (attr_output or ""))
    
    # S.M.A.R.T.ステータス
    smart_status = "PASSED" if health_output and "PASSED" in health_output else "FAILED"
    
    # 属性のパース
    attributes = parse_smart_attributes(attr_output) if attr_output else []
    
    # NVMe情報
    nvme_health = None
    if device_type == DeviceType.NVME:
        nvme_health = parse_nvme_health(device)
    
    # 温度の取得
    temperature = 0
    if nvme_health:
        temperature = nvme_health.temperature_celsius
    else:
        for attr in attributes:
            if attr.id in [194, 190]:  # Temperature_Celsius, Airflow_Temperature_Cel
                temperature = attr.raw_value
                break
    
    # 稼働時間の取得
    power_on_hours = 0
    if nvme_health:
        power_on_hours = nvme_health.power_on_hours
    else:
        for attr in attributes:
            if attr.id == 9:  # Power_On_Hours
                power_on_hours = attr.raw_value
                break
    
    # 健康状態の分析
    health_status, warnings = analyze_health(
        device_type, smart_status, attributes, nvme_health, temperature
    )
    
    return DriveHealth(
        device=device,
        model=model,
        serial=serial,
        device_type=device_type,
        capacity_gb=round(capacity_gb, 2),
        smart_status=smart_status,
        health_status=health_status,
        temperature_celsius=temperature,
        power_on_hours=power_on_hours,
        smart_attributes=attributes,
        nvme_health=nvme_health,
        warnings=warnings
    )


def scan_devices() -> List[str]:
    """
    システム上のストレージデバイスをスキャン
    
    Returns:
        デバイスパスのリスト
    """
    output = run_command(['smartctl', '--scan'])
    if not output:
        return []
    
    devices = []
    for line in output.split('\n'):
        if line.strip():
            # "/dev/sda -d scsi" のような形式
            parts = line.split()
            if parts and parts[0].startswith('/dev/'):
                devices.append(parts[0])
    
    return devices


def print_health_report(health: DriveHealth) -> None:
    """健康レポートを出力"""
    status_emoji = {
        HealthStatus.GOOD: "[OK]",
        HealthStatus.WARNING: "[WARNING]",
        HealthStatus.CRITICAL: "[CRITICAL]",
        HealthStatus.UNKNOWN: "[?]"
    }
    
    print("=" * 70)
    print(f"{status_emoji[health.health_status]} S.M.A.R.T.健康レポート - {health.timestamp}")
    print("=" * 70)
    print(f"デバイス      : {health.device}")
    print(f"モデル        : {health.model}")
    print(f"シリアル      : {health.serial}")
    print(f"タイプ        : {health.device_type.value.upper()}")
    print(f"容量          : {health.capacity_gb:.2f} GB")
    print("-" * 70)
    print(f"S.M.A.R.T.    : {health.smart_status}")
    print(f"健康状態      : {health.health_status.value.upper()}")
    print(f"温度          : {health.temperature_celsius}°C")
    print(f"稼働時間      : {health.power_on_hours:,} 時間")
    
    # NVMe固有情報
    if health.nvme_health:
        nvme = health.nvme_health
        print("-" * 70)
        print("【NVMe詳細情報】")
        print(f"  寿命消費率    : {nvme.percentage_used}%")
        print(f"  予備容量      : {nvme.available_spare}%")
        print(f"  データ書込量  : {nvme.data_units_written * 512 * 1000 / (1024**4):.2f} TB")
        print(f"  データ読込量  : {nvme.data_units_read * 512 * 1000 / (1024**4):.2f} TB")
        print(f"  不正シャットダウン: {nvme.unsafe_shutdowns}回")
        print(f"  メディアエラー: {nvme.media_errors}件")
    
    # Backblaze重要属性の表示（HDD/SSD）
    if health.device_type in [DeviceType.HDD, DeviceType.SSD]:
        critical_attrs = [a for a in health.smart_attributes if a.id in BACKBLAZE_CRITICAL_ATTRIBUTES]
        if critical_attrs:
            print("-" * 70)
            print("【故障予測に重要な属性（Backblaze研究）】")
            print(f"{'ID':>4} | {'属性名':<35} | {'現在値':>6} | {'Raw値':>10}")
            print("-" * 70)
            for attr in critical_attrs:
                raw_status = "[!]" if attr.raw_value > 0 else ""
                print(f"{attr.id:>4} | {attr.name:<35} | {attr.current:>6} | {attr.raw_value:>10} {raw_status}")
    
    # 警告メッセージ
    if health.warnings:
        print("-" * 70)
        print("【警告】")
        for warning in health.warnings:
            print(f"  - {warning}")
    
    print("=" * 70)


def main():
    """メイン処理"""
    print("S.M.A.R.T.監視ツール v1.0")
    print("ストレージデバイスの健康状態を監視します\n")
    
    # デバイスのスキャン
    devices = scan_devices()
    
    if not devices:
        print("エラー: ストレージデバイスが見つかりませんでした")
        print("  - smartmontoolsがインストールされているか確認してください")
        print("  - root/管理者権限で実行してください")
        sys.exit(1)
    
    print(f"検出されたデバイス: {devices}\n")
    
    # 各デバイスの健康状態を取得
    all_healthy = True
    for device in devices:
        health = get_drive_health(device)
        if health:
            print_health_report(health)
            print()
            if health.health_status != HealthStatus.GOOD:
                all_healthy = False
    
    # 終了コード
    sys.exit(0 if all_healthy else 1)


if __name__ == "__main__":
    main()

5.4 実行結果

上記のコードを実行すると、以下のような出力が得られる（環境により異なる）：

$ sudo python smart_monitor.py
S.M.A.R.T.監視ツール v1.0
ストレージデバイスの健康状態を監視します

検出されたデバイス: ['/dev/sda', '/dev/nvme0n1']

======================================================================
[OK] S.M.A.R.T.健康レポート - 2026-02-01T15:30:00
======================================================================
デバイス      : /dev/sda
モデル        : WDC WD40EFRX-68N32N0
シリアル      : WD-WXXXXXXXXXXXX
タイプ        : HDD
容量          : 3725.90 GB
----------------------------------------------------------------------
S.M.A.R.T.    : PASSED
健康状態      : GOOD
温度          : 35°C
稼働時間      : 12,543 時間
----------------------------------------------------------------------
【故障予測に重要な属性（Backblaze研究）】
  ID | 属性名                              | 現在値 |     Raw値
----------------------------------------------------------------------
   5 | Reallocated_Sector_Ct               |    100 |          0
 187 | Reported_Uncorrect                  |    100 |          0
 188 | Command_Timeout                     |    100 |          0
 197 | Current_Pending_Sector              |    100 |          0
 198 | Offline_Uncorrectable               |    100 |          0
======================================================================

======================================================================
[OK] S.M.A.R.T.健康レポート - 2026-02-01T15:30:00
======================================================================
デバイス      : /dev/nvme0n1
モデル        : Samsung SSD 980 PRO 1TB
シリアル      : S5XXXXXXXXXXXX
タイプ        : NVME
容量          : 931.51 GB
----------------------------------------------------------------------
S.M.A.R.T.    : PASSED
健康状態      : GOOD
温度          : 42°C
稼働時間      : 2,543 時間
----------------------------------------------------------------------
【NVMe詳細情報】
  寿命消費率    : 3%
  予備容量      : 100%
  データ書込量  : 15.70 TB
  データ読込量  : 8.50 TB
  不正シャットダウン: 5回
  メディアエラー: 0件
======================================================================

5.5 よくあるエラーと対処法

エラー	原因	対処法
`smartctl: command not found`	smartmontoolsが未インストール	`apt install smartmontools` または `brew install smartmontools`
`Permission denied`	権限不足	`sudo` を付けて実行（Linux/macOS）、管理者権限で実行（Windows）
`Unable to detect device type`	デバイスタイプ自動検出失敗	`smartctl -d sat /dev/sdX` のように `-d` オプションでタイプを指定
`SMART support is: Unavailable`	S.M.A.R.T.非対応または無効	BIOSでS.M.A.R.T.を有効化、またはRAIDコントローラー配下の可能性
`Read Device Identity failed`	NVMeデバイスの認識失敗	`nvme-cli` をインストールし、`nvme smart-log /dev/nvme0` で直接確認
`Smartctl open device failed`	デバイスがビジー状態	少し待ってから再試行、または他のプロセスがデバイスを使用中か確認

5.6 環境診断スクリプト

問題が発生した場合は、以下のスクリプトで環境を診断できる：

#!/usr/bin/env python3
"""
環境診断スクリプト
S.M.A.R.T.監視ツールの動作環境をチェックする

実行方法: python check_env.py
"""

import subprocess
import sys
import platform
import os


def check_environment():
    """環境をチェックして問題を報告"""
    issues = []
    info = []
    
    # OS情報
    os_info = f"{platform.system()} {platform.release()}"
    info.append(f"OS: {os_info}")
    
    # Pythonバージョン確認
    py_version = sys.version_info
    info.append(f"Python: {py_version.major}.{py_version.minor}.{py_version.micro}")
    
    if py_version < (3, 8):
        issues.append(f"Python 3.8以上が必要です（現在: {sys.version}）")
    
    # smartctlの確認
    try:
        result = subprocess.run(
            ['smartctl', '--version'],
            capture_output=True,
            text=True,
            timeout=5
        )
        version_line = result.stdout.split('\n')[0]
        info.append(f"smartctl: {version_line}")
    except FileNotFoundError:
        issues.append("smartctl がインストールされていません")
        issues.append("  Linux (Debian): sudo apt-get install smartmontools")
        issues.append("  Linux (RHEL): sudo dnf install smartmontools")
        issues.append("  macOS: brew install smartmontools")
        issues.append("  Windows: https://www.smartmontools.org/wiki/Download")
    except subprocess.TimeoutExpired:
        issues.append("smartctl の実行がタイムアウトしました")
    
    # nvme-cliの確認（オプション）
    try:
        result = subprocess.run(
            ['nvme', 'version'],
            capture_output=True,
            text=True,
            timeout=5
        )
        version_line = result.stdout.strip()
        info.append(f"nvme-cli: {version_line}")
    except FileNotFoundError:
        info.append("nvme-cli: 未インストール（NVMe監視時は推奨）")
    except subprocess.TimeoutExpired:
        pass
    
    # 権限確認（Unix系のみ）
    if platform.system() != 'Windows':
        if os.geteuid() != 0:
            issues.append("root権限が必要です（sudo を使用してください）")
    
    # デバイススキャンのテスト
    try:
        result = subprocess.run(
            ['smartctl', '--scan'],
            capture_output=True,
            text=True,
            timeout=10
        )
        devices = [line.split()[0] for line in result.stdout.split('\n') if line.strip() and '/dev/' in line]
        info.append(f"検出デバイス数: {len(devices)}")
        if not devices:
            issues.append("ストレージデバイスが検出されませんでした")
    except Exception as e:
        issues.append(f"デバイススキャン失敗: {e}")
    
    # 結果出力
    print("=== S.M.A.R.T.監視環境診断 ===\n")
    
    print("【システム情報】")
    for item in info:
        print(f"  {item}")
    
    print()
    
    if issues:
        print("【問題が見つかりました】")
        for issue in issues:
            print(f"  - {issue}")
        return False
    else:
        print("【結果】環境は正常です")
        return True


if __name__ == "__main__":
    success = check_environment()
    sys.exit(0 if success else 1)

5.7 Docker設定（オプション）

コンテナ環境でS.M.A.R.T.監視を行う場合：

# Dockerfile
FROM python:3.11-slim

# smartmontoolsとnvme-cliのインストール
RUN apt-get update && \
    apt-get install -y smartmontools nvme-cli && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY smart_monitor.py .
COPY config.yaml .

CMD ["python", "smart_monitor.py"]

# docker-compose.yml
version: '3.8'
services:
  smart-monitor:
    build: .
    privileged: true  # デバイスアクセスに必要
    volumes:
      - /dev:/dev:ro
    environment:
      - TZ=Asia/Tokyo
    # 定期実行する場合
    # entrypoint: ["sh", "-c", "while true; do python smart_monitor.py; sleep 3600; done"]

実装方法がわかったので、次は具体的なユースケースを見ていこう。

6. ユースケース別ガイド

6.1 ユースケース1: 個人PCの定期健康診断

想定読者: 個人開発者、PCユーザー

推奨構成: 週1回の手動チェック + CrystalDiskInfo常駐

個人PCでは、大規模な監視システムは不要だ。シンプルに重要な指標だけを確認すれば良い。

サンプルコード（簡易チェックスクリプト）:

#!/usr/bin/env python3
"""
個人PC向け: 簡易S.M.A.R.T.ヘルスチェック
週1回実行して、問題がなければOKを表示

使い方:
    sudo python quick_check.py
"""

import subprocess
import sys
import re


def quick_health_check():
    """
    簡易健康チェック
    重要な指標だけを確認し、問題があれば警告
    """
    # デバイスをスキャン
    scan_result = subprocess.run(
        ['smartctl', '--scan'],
        capture_output=True,
        text=True
    )
    
    devices = []
    for line in scan_result.stdout.split('\n'):
        if line.strip() and '/dev/' in line:
            devices.append(line.split()[0])
    
    if not devices:
        print("デバイスが見つかりませんでした")
        return False
    
    all_ok = True
    
    for device in devices:
        print(f"\n--- {device} ---")
        
        # S.M.A.R.T.ステータス確認
        health_result = subprocess.run(
            ['smartctl', '-H', device],
            capture_output=True,
            text=True
        )
        
        if "PASSED" in health_result.stdout:
            print("  S.M.A.R.T.ステータス: OK")
        else:
            print("  S.M.A.R.T.ステータス: NG (!)")
            all_ok = False
        
        # 詳細情報を取得
        info_result = subprocess.run(
            ['smartctl', '-A', device],
            capture_output=True,
            text=True
        )
        
        # 重要属性のチェック（Backblaze 5属性）
        critical_ids = [5, 187, 188, 197, 198]
        output = info_result.stdout
        
        problems = []
        for line in output.split('\n'):
            for cid in critical_ids:
                # 属性行のパターンにマッチ
                pattern = rf'^\s*{cid}\s+\S+\s+0x[0-9a-f]+\s+\d+\s+\d+\s+\d+\s+\S+\s+\S+\s+\S+\s+(\d+)'
                match = re.match(pattern, line)
                if match:
                    raw_value = int(match.group(1))
                    if raw_value > 0:
                        problems.append(f"ID:{cid} Raw値={raw_value}")
        
        if problems:
            print(f"  警告属性: {', '.join(problems)}")
            all_ok = False
        else:
            print("  重要属性: すべてOK")
        
        # NVMeの場合は追加チェック
        if '/nvme' in device:
            nvme_result = subprocess.run(
                ['smartctl', '-A', device],
                capture_output=True,
                text=True
            )
            
            # Percentage Used を確認
            match = re.search(r'Percentage Used:\s*(\d+)%?', nvme_result.stdout)
            if match:
                pct_used = int(match.group(1))
                if pct_used >= 80:
                    print(f"  NVMe寿命消費: {pct_used}% (警告!)")
                    all_ok = False
                else:
                    print(f"  NVMe寿命消費: {pct_used}% (OK)")
    
    print("\n" + "=" * 40)
    if all_ok:
        print("結果: すべてのドライブは健康です")
    else:
        print("結果: 注意が必要なドライブがあります")
        print("  -> バックアップを取り、交換を検討してください")
    
    return all_ok


if __name__ == "__main__":
    try:
        success = quick_health_check()
        sys.exit(0 if success else 1)
    except PermissionError:
        print("エラー: 管理者権限で実行してください")
        print("  Linux/macOS: sudo python quick_check.py")
        sys.exit(1)
    except FileNotFoundError:
        print("エラー: smartmontoolsがインストールされていません")
        sys.exit(1)

6.2 ユースケース2: サーバー監視システムとの連携

想定読者: SRE、インフラエンジニア

推奨構成: Prometheus + node_exporter + Grafana

サーバー環境では、S.M.A.R.T.データを既存の監視システムに統合するのが効率的だ。

サンプルコード（Prometheus Exporter形式）:

#!/usr/bin/env python3
"""
サーバー向け: Prometheus Exporter形式でS.M.A.R.T.メトリクスを出力
HTTPサーバーとして動作し、/metricsエンドポイントでメトリクスを提供

使い方:
    sudo python smart_exporter.py
    # http://localhost:9100/metrics でメトリクスを取得

Prometheusの設定例:
    scrape_configs:
      - job_name: 'smart'
        static_configs:
          - targets: ['localhost:9100']
"""

import subprocess
import re
import json
from http.server import HTTPServer, BaseHTTPRequestHandler
from typing import Dict, List, Tuple


def get_smart_metrics() -> List[str]:
    """
    S.M.A.R.T.メトリクスをPrometheus形式で取得
    
    Returns:
        メトリクス行のリスト
    """
    metrics = []
    
    # ヘルプとタイプの定義
    metrics.append("# HELP smart_device_info S.M.A.R.T. device information")
    metrics.append("# TYPE smart_device_info gauge")
    metrics.append("# HELP smart_health_passed S.M.A.R.T. health test passed (1=passed, 0=failed)")
    metrics.append("# TYPE smart_health_passed gauge")
    metrics.append("# HELP smart_temperature_celsius Current drive temperature in Celsius")
    metrics.append("# TYPE smart_temperature_celsius gauge")
    metrics.append("# HELP smart_power_on_hours Total power on hours")
    metrics.append("# TYPE smart_power_on_hours counter")
    metrics.append("# HELP smart_attribute_raw_value S.M.A.R.T. attribute raw value")
    metrics.append("# TYPE smart_attribute_raw_value gauge")
    metrics.append("# HELP smart_nvme_percentage_used NVMe percentage used")
    metrics.append("# TYPE smart_nvme_percentage_used gauge")
    metrics.append("# HELP smart_nvme_available_spare NVMe available spare percentage")
    metrics.append("# TYPE smart_nvme_available_spare gauge")
    
    # デバイスをスキャン
    scan_result = subprocess.run(
        ['smartctl', '--scan', '-j'],
        capture_output=True,
        text=True
    )
    
    try:
        scan_data = json.loads(scan_result.stdout)
        devices = [d['name'] for d in scan_data.get('devices', [])]
    except (json.JSONDecodeError, KeyError):
        # フォールバック
        devices = []
        for line in scan_result.stdout.split('\n'):
            if line.strip() and '/dev/' in line:
                devices.append(line.split()[0])
    
    for device in devices:
        # JSON形式で詳細情報を取得
        info_result = subprocess.run(
            ['smartctl', '-a', '-j', device],
            capture_output=True,
            text=True
        )
        
        try:
            data = json.loads(info_result.stdout)
        except json.JSONDecodeError:
            continue
        
        # デバイス情報
        model = data.get('model_name', 'unknown').replace('"', '\\"')
        serial = data.get('serial_number', 'unknown')
        device_type = data.get('device', {}).get('type', 'unknown')
        
        # ラベル
        labels = f'device="{device}",model="{model}",serial="{serial}",type="{device_type}"'
        
        # デバイス情報メトリクス
        metrics.append(f'smart_device_info{{{labels}}} 1')
        
        # 健康状態
        smart_status = data.get('smart_status', {})
        passed = 1 if smart_status.get('passed', False) else 0
        metrics.append(f'smart_health_passed{{{labels}}} {passed}')
        
        # 温度
        temp = data.get('temperature', {}).get('current', 0)
        if temp > 0:
            metrics.append(f'smart_temperature_celsius{{{labels}}} {temp}')
        
        # ATA S.M.A.R.T.属性
        ata_smart = data.get('ata_smart_attributes', {})
        for attr in ata_smart.get('table', []):
            attr_id = attr.get('id', 0)
            attr_name = attr.get('name', 'unknown')
            raw_value = attr.get('raw', {}).get('value', 0)
            
            attr_labels = f'{labels},attr_id="{attr_id}",attr_name="{attr_name}"'
            metrics.append(f'smart_attribute_raw_value{{{attr_labels}}} {raw_value}')
            
            # 稼働時間
            if attr_id == 9:
                metrics.append(f'smart_power_on_hours{{{labels}}} {raw_value}')
        
        # NVMe固有メトリクス
        nvme_smart = data.get('nvme_smart_health_information_log', {})
        if nvme_smart:
            pct_used = nvme_smart.get('percentage_used', 0)
            avail_spare = nvme_smart.get('available_spare', 100)
            power_on = nvme_smart.get('power_on_hours', 0)
            
            metrics.append(f'smart_nvme_percentage_used{{{labels}}} {pct_used}')
            metrics.append(f'smart_nvme_available_spare{{{labels}}} {avail_spare}')
            metrics.append(f'smart_power_on_hours{{{labels}}} {power_on}')
    
    return metrics


class MetricsHandler(BaseHTTPRequestHandler):
    """Prometheusメトリクスを提供するHTTPハンドラ"""
    
    def do_GET(self):
        if self.path == '/metrics':
            metrics = get_smart_metrics()
            response = '\n'.join(metrics) + '\n'
            
            self.send_response(200)
            self.send_header('Content-Type', 'text/plain; charset=utf-8')
            self.end_headers()
            self.wfile.write(response.encode('utf-8'))
        else:
            self.send_response(404)
            self.end_headers()
    
    def log_message(self, format, *args):
        pass  # ログを抑制


def main():
    port = 9100
    server = HTTPServer(('0.0.0.0', port), MetricsHandler)
    print(f"S.M.A.R.T. Exporter started on port {port}")
    print(f"Metrics available at http://localhost:{port}/metrics")
    server.serve_forever()


if __name__ == "__main__":
    main()

6.3 ユースケース3: 大規模NAS/ストレージサーバーの予防保守

想定読者: データセンターオペレーター、NAS管理者

推奨構成: 日次スキャン + 週次レポート + しきい値ベースのアラート

数十〜数百台のドライブを管理する環境では、予防的な交換計画が重要だ。

サンプルコード（バッチレポート生成）:

#!/usr/bin/env python3
"""
大規模環境向け: バッチS.M.A.R.T.レポート生成
全ドライブをスキャンし、CSVレポートと交換推奨リストを生成

使い方:
    sudo python batch_report.py

出力:
    - smart_report_YYYYMMDD.csv: 全ドライブのS.M.A.R.T.データ
    - replacement_list_YYYYMMDD.txt: 交換推奨ドライブのリスト
"""

import subprocess
import json
import csv
import sys
from datetime import datetime
from dataclasses import dataclass, asdict
from typing import List, Optional


@dataclass
class DriveReport:
    """ドライブレポート"""
    device: str
    model: str
    serial: str
    capacity_gb: float
    device_type: str
    smart_status: str
    health_status: str
    temperature: int
    power_on_hours: int
    reallocated_sectors: int
    pending_sectors: int
    uncorrectable_errors: int
    nvme_percentage_used: int
    nvme_available_spare: int
    recommendation: str


def get_all_drives_report() -> List[DriveReport]:
    """
    全ドライブのレポートを生成
    """
    reports = []
    
    # デバイスをスキャン
    scan_result = subprocess.run(
        ['smartctl', '--scan'],
        capture_output=True,
        text=True
    )
    
    devices = []
    for line in scan_result.stdout.split('\n'):
        if line.strip() and '/dev/' in line:
            devices.append(line.split()[0])
    
    for device in devices:
        # JSON形式で詳細情報を取得
        info_result = subprocess.run(
            ['smartctl', '-a', '-j', device],
            capture_output=True,
            text=True
        )
        
        try:
            data = json.loads(info_result.stdout)
        except json.JSONDecodeError:
            continue
        
        # 基本情報
        model = data.get('model_name', 'Unknown')
        serial = data.get('serial_number', 'Unknown')
        capacity = data.get('user_capacity', {}).get('bytes', 0) / (1024**3)
        device_type = 'NVME' if '/nvme' in device else ('SSD' if 'solid state' in str(data).lower() else 'HDD')
        
        # S.M.A.R.T.ステータス
        smart_status = "PASSED" if data.get('smart_status', {}).get('passed', False) else "FAILED"
        
        # 温度
        temperature = data.get('temperature', {}).get('current', 0)
        
        # 稼働時間
        power_on_hours = 0
        
        # ATA属性
        reallocated = 0
        pending = 0
        uncorrectable = 0
        
        ata_smart = data.get('ata_smart_attributes', {})
        for attr in ata_smart.get('table', []):
            attr_id = attr.get('id', 0)
            raw_value = attr.get('raw', {}).get('value', 0)
            
            if attr_id == 5:
                reallocated = raw_value
            elif attr_id == 9:
                power_on_hours = raw_value
            elif attr_id == 197:
                pending = raw_value
            elif attr_id == 198:
                uncorrectable = raw_value
        
        # NVMe属性
        nvme_pct_used = 0
        nvme_avail_spare = 100
        
        nvme_smart = data.get('nvme_smart_health_information_log', {})
        if nvme_smart:
            nvme_pct_used = nvme_smart.get('percentage_used', 0)
            nvme_avail_spare = nvme_smart.get('available_spare', 100)
            power_on_hours = nvme_smart.get('power_on_hours', power_on_hours)
        
        # 健康状態と推奨事項の判定
        health_status = "GOOD"
        recommendation = "継続使用OK"
        
        if smart_status == "FAILED":
            health_status = "CRITICAL"
            recommendation = "即時交換"
        elif device_type == "NVME":
            if nvme_pct_used >= 100 or nvme_avail_spare <= 10:
                health_status = "CRITICAL"
                recommendation = "即時交換"
            elif nvme_pct_used >= 80 or nvme_avail_spare <= 20:
                health_status = "WARNING"
                recommendation = "交換計画"
        else:
            if reallocated > 10 or uncorrectable > 0:
                health_status = "CRITICAL"
                recommendation = "即時交換"
            elif reallocated > 0 or pending > 0:
                health_status = "WARNING"
                recommendation = "要監視"
        
        if temperature > 60:
            if health_status == "GOOD":
                health_status = "WARNING"
            recommendation += " / 冷却改善"
        
        reports.append(DriveReport(
            device=device,
            model=model,
            serial=serial,
            capacity_gb=round(capacity, 2),
            device_type=device_type,
            smart_status=smart_status,
            health_status=health_status,
            temperature=temperature,
            power_on_hours=power_on_hours,
            reallocated_sectors=reallocated,
            pending_sectors=pending,
            uncorrectable_errors=uncorrectable,
            nvme_percentage_used=nvme_pct_used,
            nvme_available_spare=nvme_avail_spare,
            recommendation=recommendation
        ))
    
    return reports


def main():
    print("S.M.A.R.T.バッチレポート生成ツール")
    print("=" * 50)
    
    # レポート生成
    reports = get_all_drives_report()
    
    if not reports:
        print("エラー: ドライブが見つかりませんでした")
        sys.exit(1)
    
    # 日付文字列
    date_str = datetime.now().strftime("%Y%m%d")
    
    # CSVレポート出力
    csv_filename = f"smart_report_{date_str}.csv"
    with open(csv_filename, 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=list(asdict(reports[0]).keys()))
        writer.writeheader()
        for report in reports:
            writer.writerow(asdict(report))
    print(f"CSVレポート出力: {csv_filename}")
    
    # 交換推奨リスト
    replacement_needed = [r for r in reports if r.health_status in ["CRITICAL", "WARNING"]]
    
    txt_filename = f"replacement_list_{date_str}.txt"
    with open(txt_filename, 'w', encoding='utf-8') as f:
        f.write(f"S.M.A.R.T.交換推奨ドライブリスト ({datetime.now().isoformat()})\n")
        f.write("=" * 60 + "\n\n")
        
        if replacement_needed:
            critical = [r for r in replacement_needed if r.health_status == "CRITICAL"]
            warning = [r for r in replacement_needed if r.health_status == "WARNING"]
            
            if critical:
                f.write("【即時交換が必要】\n")
                for r in critical:
                    f.write(f"  {r.device}: {r.model} ({r.serial}) - {r.recommendation}\n")
                f.write("\n")
            
            if warning:
                f.write("【交換計画が必要】\n")
                for r in warning:
                    f.write(f"  {r.device}: {r.model} ({r.serial}) - {r.recommendation}\n")
        else:
            f.write("交換が必要なドライブはありません。\n")
    
    print(f"交換推奨リスト出力: {txt_filename}")
    
    # サマリー
    print("\n" + "=" * 50)
    print("【サマリー】")
    print(f"  総ドライブ数: {len(reports)}")
    print(f"  正常: {len([r for r in reports if r.health_status == 'GOOD'])}")
    print(f"  警告: {len([r for r in reports if r.health_status == 'WARNING'])}")
    print(f"  危険: {len([r for r in reports if r.health_status == 'CRITICAL'])}")


if __name__ == "__main__":
    main()

ユースケースを把握できたところで、この先の学習パスを確認しよう。

7. 学習ロードマップ

この記事を読んだ後、次のステップとして以下をおすすめする。

初級者向け（まずはここから）

自分のPCでS.M.A.R.T.を確認する
- Windows: CrystalDiskInfoをインストール
- macOS: Disk Utilityの「情報を見る」でS.M.A.R.T.ステータスを確認
- Linux: sudo smartctl -a /dev/sda を実行
重要な5属性を覚える
- Reallocated Sector Count (ID:5)
- Reported Uncorrectable Errors (ID:187)
- Command Timeout (ID:188)
- Current Pending Sector (ID:197)
- Offline Uncorrectable (ID:198)

中級者向け（実践に進む）

定期監視の自動化
- cron/Task Schedulerで週1回のチェックを設定
- 異常時にメール通知を送る仕組みを構築
既存監視システムとの統合
- node_exporter のsmartctlコレクターを有効化
- Grafanaでダッシュボードを作成

上級者向け（さらに深く）

Backblazeの公開データセットで分析
- Backblaze Drive Stats のデータをダウンロード
- 機械学習で故障予測モデルを構築
NVMe規格の深堀り
- NVM Express Base Specification を読む
- Log Page 02h（SMART / Health Information）の全フィールドを理解

8. まとめ

この記事では、S.M.A.R.T.について以下を解説した：

S.M.A.R.T.の基本: ストレージデバイス内蔵の自己診断システムで、故障を予測
重要な属性: Backblaze研究に基づく5つの重要属性と、NVMe固有の指標
HDD vs SSD vs NVMe: デバイスタイプによって監視すべきポイントが異なる
実践的な監視: Python/smartmontoolsで自動化し、Prometheusと連携可能

私の所感

S.M.A.R.T.は完璧ではない。Googleの研究によると、故障したドライブの36%はS.M.A.R.T.警告を一切出さなかった。突然死を100%防ぐことは不可能だ。

しかし、64%の故障には何らかの兆候があった。これは見逃すには大きすぎる数字だ。特にBackblazeが特定した5つの属性（Reallocated Sector Count、Reported Uncorrectable Errors、Command Timeout、Current Pending Sector、Offline Uncorrectable）は、定期的にチェックする価値がある。

NVMeの普及により、S.M.A.R.T.の形式は標準化が進んでいる。メーカー間の解釈の違いに悩まされることは少なくなった。これは監視の自動化にとって朗報だ。

最後に最も重要なことを言おう。S.M.A.R.T.監視は「保険」であって「バックアップの代替」ではない。どんなに優秀な監視システムを構築しても、定期的なバックアップに勝る対策はない。

参考文献

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up