ちょっと早い夏休みだ自由研究をしよう(2日目)

TL;DR

linux kernelには、fault-injectionの機能がある。有効にするにはKernel configを変更する必要がある。
いくつかの機能については既に実装がある。自作デバイスドライバに当該機能を組み込むことは可能と考える。
（悩ましい）製品搭載時には外した方が無難だろうという気持ちと、いやいやテスト品質担保にはむしろ入れた状態でテストしたなら、そのまま製品に入れるべきだという気持ちで、なかなかに悩ましい。

はじめに

組込デバイス開発をしていると、「これ、ハードウェア側のトラブルをどうやって、ソフトウェアで品質保証すればいいんだ？」と悩む事がしばしばある。

コンパイル時にわざとエラーに分岐するように書き換える、なんてこともできなくともない。これが一番簡単なテストの方法であろう。しかし、それをやると開発バイナリと製品バイナリとで差分が出てしまう。よって、それは本当にテストしたと言えるのか？という哲学的に難しい問題になる。

そういえば、フォールトインジェション、なんていう話もあったなあ、ということでこのあたりについてもう少し調べてみる。

フォールトインジェクションとは？

まず用語として、"fault injection" とは直訳するならば "誤りを注入"。である。

wikipediaのフォールトインジェクションにも記載がある通り、大きく分けると２種類ある。

(A)コンパイル時インジェクション
(B)ランタイムインジェクション

先ほどの「コンパイル時にわざとエラーに分岐するように書き換える」とは、まさに(A)だろう。それでは(B)の手法をもう少し、調べていく。

Linux Kernelとfault injection

Kernelのdocumentにも、fault injectionのページがある。

https://www.kernel.org/doc/html/latest/fault-injection/index.html

Linux Kernelのfault injectionは、フィックスターズ美田晃伸氏が開発した機能である。いくつか紹介記事がある（そして、美田氏資料が一番まとまっていてわかりやすいかもしれない）。

実際のmmcコードで確認してみる

Kernel documentには、mmc_requestにもfailt injectionできる、という記載がある。

- fail_mmc_request

  injects MMC data errors on devices permitted by setting
  debugfs entries under /sys/kernel/debug/mmc0/fail_mmc_request

このあたりを深堀していこう。

Kernel Config（FAIL_MMC_REQUEST)

config FAIL_MMC_REQUEST は、lib/Kconfig.debug に定義がある。

config FAIL_MMC_REQUEST
bool "Fault-injection capability for MMC IO"
depends on FAULT_INJECTION_DEBUG_FS && MMC
help
Provide fault-injection capability for MMC IO.
This will make the mmc core return data errors. This is
useful to test the error handling in the mmc block device
and to test how the mmc host driver handles retries from
the block device.

MMC IOに対するFault-injection capabitlity.
これは、mmc core がdata errorをreturnするようにするものです。
これは、mmc block deviceや、mmc host driverがblock deviceからのリトライのハンドリングに対するテストに有用です。

define the fault attributes

failt attributesの定義は、mmc/core/debugfs.c で行われている。

drivers/mmc/core/debugfs.c

#ifdef CONFIG_FAIL_MMC_REQUEST

static DECLARE_FAULT_ATTR(fail_default_attr);
static char *fail_request;
module_param(fail_request, charp, 0);

#endif /* CONFIG_FAIL_MMC_REQUEST */

boot optionのサポートと、debugfs entriesの追加

boot optionのサポートと、debugfs entriesの追加は、mmc/core/debugfs.c で行われている。

drivers/mmc/core/debugfs.c

#ifdef CONFIG_FAIL_MMC_REQUEST
    if (fail_request)
        setup_fault_attr(&fail_default_attr, fail_request);
    host->fail_mmc_request = fail_default_attr;
    fault_create_debugfs_attr("fail_mmc_request", root,
                  &host->fail_mmc_request);
#endif

add a hook to insert failures

mmc/core/block.c から mmc/core/core.cへ、mmc_start_request()が発行された場合にCONFIG_FAIL_MMC_REQUESTが有効であれば、[最終的にはshould_fail()が呼ばれる]。そうでなければ何もしない。(https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/mmc/core/core.c?h=v5.13.7#n76)。

mmc/core/core.c

#ifdef CONFIG_FAIL_MMC_REQUEST

/*
 * Internal function. Inject random data errors.
 * If mmc_data is NULL no errors are injected.
 */
static void mmc_should_fail_request(struct mmc_host *host,
                    struct mmc_request *mrq)
{
    struct mmc_command *cmd = mrq->cmd;
    struct mmc_data *data = mrq->data;
    static const int data_errors[] = {
        -ETIMEDOUT,
        -EILSEQ,
        -EIO,
    };

    if (!data)
        return;

    if ((cmd && cmd->error) || data->error ||
        !should_fail(&host->fail_mmc_request, data->blksz * data->blocks))
        return;

    data->error = data_errors[prandom_u32() % ARRAY_SIZE(data_errors)];
    data->bytes_xfered = (prandom_u32() % (data->bytes_xfered >> 9)) << 9;
}

#else /* CONFIG_FAIL_MMC_REQUEST */

static inline void mmc_should_fail_request(struct mmc_host *host,
                       struct mmc_request *mrq)
{
}

#endif /* CONFIG_FAIL_MMC_REQUEST */

should_fail()の実体は、lib/fault-inject.c にある。

return false の場合は「問題はなかった」。 return trueの場合は「問題が起きた」となることに注意。

should_fail()の第２引数で渡しているデータサイズは、attr->size分だけこれを減算していくという指標になる。例えば、120MiB分だけ処理できたところで止める、ということが多分できる。

/*
 * This code is stolen from failmalloc-1.0
 * http://www.nongnu.org/failmalloc/
 */

bool should_fail(struct fault_attr *attr, ssize_t size)
{
    if (in_task()) {
        unsigned int fail_nth = READ_ONCE(current->fail_nth);

        if (fail_nth) {
            fail_nth--;
            WRITE_ONCE(current->fail_nth, fail_nth);
            if (!fail_nth)
                goto fail;

            return false;
        }
    }

    /* No need to check any other properties if the probability is 0 */
    if (attr->probability == 0)
        return false;

    if (attr->task_filter && !fail_task(attr, current))
        return false;

    if (atomic_read(&attr->times) == 0)
        return false;

    if (atomic_read(&attr->space) > size) {
        atomic_sub(size, &attr->space);
        return false;
    }

    if (attr->interval > 1) {
        attr->count++;
        if (attr->count % attr->interval)
            return false;
    }

    if (attr->probability <= prandom_u32() % 100)
        return false;

    if (!fail_stacktrace(attr))
        return false;

fail:
    fail_dump(attr);

    if (atomic_read(&attr->times) != -1)
        atomic_dec_not_zero(&attr->times);

    return true;
}
EXPORT_SYMBOL_GPL(should_fail);

ここにきて悩む事

ふとした疑問が湧き出る。開発段階ではこの機能を使ってテストをすることは多いに意味があるだろう。では、製品段階ではこの機能はどうするべきなのか？

コンパイル時フォールトインジェクションであれば、当然製品に入れられないので、はずす一択だろう。
ランタイムフォールトインジェクションは、製品に入っていても有効にならなければ実害はない。逆を言えば、有効にできると実害となる。そうなると外すべきか？外さざるべきか？有効にできるような状況になることそのものが問題なのか…

と、なかなかに悩ましい。。。

まとめ

linux kernelには、fault-injectionの機能がある。有効にするにはKernel configを変更する必要がある。
いくつかの機能については既に実装がある。自作デバイスドライバに当該機能を組み込むことは可能と考える。
（悩ましい）製品搭載時には外した方が無難だろうという気持ちと、いやいやテスト品質担保にはむしろ入れた状態でテストしたなら、そのまま製品に入れるべきだという気持ちで、なかなかに悩ましい。

以上になります。

linux kernelのフォールトインジェクションする仕組みを確認する