More than 5 years have passed since last update.

ext4のnobarrierとjournal_async_commitの詳細

Posted at 2017-02-23

はじめに

先の記事の予告の通り、nobarrierとjournal_async_commitがどう動くのかについての詳細の調査を実施した。

なお、ほやほやのLinux-4.10くらいを見ています。

Documentation

kernel/Documentation/filesystems/ext4.txtより

185 barrier=<0|1(*)>        This enables/disables the use of write barriers in
186 barrier(*)              the jbd code.  barrier=0 disables, barrier=1 enables.
187 nobarrier               This also requires an IO stack which can support
188                         barriers, and if jbd gets an error on a barrier
189                         write, it will disable again with a warning.
190                         Write barriers enforce proper on-disk ordering
191                         of journal commits, making volatile disk write caches
192                         safe to use, at some performance penalty.  If
193                         your disks are battery-backed in one way or another,
194                         disabling barriers may safely improve performance.
195                         The mount options "barrier" and "nobarrier" can
196                         also be used to enable or disable barriers, for
197                         consistency with other ext4 mount options.

142 journal_async_commit    Commit block can be written to disk without waiting
143                         for descriptor blocks. If enabled older kernels cannot
144                         mount the device. This will enable 'journal_checksum'
145                         internally.

nobarrier

オプション指定の許容度

nobarrier, もしくは barrier, **barrier=[0|1]**でいける

コードを読む

kernel/fs/ext4/super.cより、barrierのときにEXT4_MOUNT_BARRIERが立つことがわかる。

super.c

1352         {Opt_barrier, "barrier=%u"},
1353         {Opt_barrier, "barrier"},
1354         {Opt_nobarrier, "nobarrier"},

super.c

1534         {Opt_barrier, EXT4_MOUNT_BARRIER, MOPT_SET},
1535         {Opt_nobarrier, EXT4_MOUNT_BARRIER, MOPT_CLEAR},

kernel/fs/ext4/super.c:ext4_init_journal_params()より、jbd2に渡すオプションを作っている。つまりjbd2の中ではJBD2_BARRIERを追えばよいとわかる。

super.c

4308         if (test_opt(sb, BARRIER))
4309                 journal->j_flags |= JBD2_BARRIER;
4310         else
4311                 journal->j_flags &= ~JBD2_BARRIER;

ext4のコードを読む

kernel/fs/ext4/super.c:ext4_fill_super()より、DEFMにnobarrierがない時はEXT4_MOUNT_BARRIERを立っている。DEFMはext4のsuperblockに書かれるデフォルトマウントオプションのこと。

super.c

3478         if ((def_mount_opts & EXT4_DEFM_NOBARRIER) == 0)
3479                 set_opt(sb, BARRIER);

kernel/fs/ext4/super.c:ext4_load_journal()より、KERN_INFOにメッセージを書いている。逆に言えば、dmesgを見ればnobarrierだったかどうかがわかる。

super.c

4524         if (!(journal->j_flags & JBD2_BARRIER))
4525                 ext4_msg(sb, KERN_INFO, "barriers disabled");

kernel/fs/ext4/super.c:ext4_commit_super()より、superblockをsyncで書く時にREQ_FUAかREQ_SYNCかを選んでいる。

super.c

4616         if (sync) {
4617                 unlock_buffer(sbh);
4618                 error = __sync_dirty_buffer(sbh,
4619                         test_opt(sb, BARRIER) ? REQ_FUA : REQ_SYNC);
4620                 if (error)
4621                         return error;
4622 
4623                 error = buffer_write_io_error(sbh);
4624                 if (error) {
4625                         ext4_msg(sb, KERN_ERR, "I/O error while writing "
4626                                "superblock");
4627                         clear_buffer_write_io_error(sbh);
4628                         set_buffer_uptodate(sbh);
4629                 }
4630         }

kernel/fs/ext4/super.c:ext4_sync_fs()より、syncの最後に**blkdev_issue_flush()**を呼ぶ(REQ_OP_FLUSHを送る)必要があるかどうかを確認している。コメントにあるように、「writebackなinodeを書き出したけどjournalはなにも書き出す必要がなかった」ような場合が該当すると思われる。

super.c

4730         /*
4731          * Data writeback is possible w/o journal transaction, so barrier must
4732          * being sent at the end of the function. But we can skip it if
4733          * transaction_commit will do it for us.
4734          */
4735         if (sbi->s_journal) {
4736                 target = jbd2_get_latest_transaction(sbi->s_journal);
4737                 if (wait && sbi->s_journal->j_flags & JBD2_BARRIER &&
4738                     !jbd2_trans_will_send_data_barrier(sbi->s_journal, target))
4739                         needs_barrier = true;
4740 
4741                 if (jbd2_journal_start_commit(sbi->s_journal, &target)) {
4742                         if (wait)
4743                                 ret = jbd2_log_wait_commit(sbi->s_journal,
4744                                                            target);
4745                 }
4746         } else if (wait && test_opt(sb, BARRIER))
4747                 needs_barrier = true;
4748         if (needs_barrier) {
4749                 int err;
4750                 err = blkdev_issue_flush(sb->s_bdev, GFP_KERNEL, NULL);
4751                 if (!ret)
4752                         ret = err;
4753         }

kernel/fs/ext4/fsync.c:ext4_sync_file()より、おおむね先と同じようなことをやっている。

fsync.c

115         if (!journal) {
116                 ret = __generic_file_fsync(file, start, end, datasync);
117                 if (!ret)
118                         ret = ext4_sync_parent(inode);
119                 if (test_opt(inode->i_sb, BARRIER))
120                         goto issue_flush;
121                 goto out;
122         }
123 
124         ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
125         if (ret)
126                 return ret;
127         /*
128          * data=writeback,ordered:
129          *  The caller's filemap_fdatawrite()/wait will sync the data.
130          *  Metadata is in the journal, we wait for proper transaction to
131          *  commit here.
132          *
133          * data=journal:
134          *  filemap_fdatawrite won't do anything (the buffers are clean).
135          *  ext4_force_commit will write the file data into the journal and
136          *  will wait on that.
137          *  filemap_fdatawait() will encounter a ton of newly-dirtied pages
138          *  (they were dirtied by commit).  But that's OK - the blocks are
139          *  safe in-journal, which is all fsync() needs to ensure.
140          */
141         if (ext4_should_journal_data(inode)) {
142                 ret = ext4_force_commit(inode->i_sb);
143                 goto out;
144         }
145 
146         commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
147         if (journal->j_flags & JBD2_BARRIER &&
148             !jbd2_trans_will_send_data_barrier(journal, commit_tid))
149                 needs_barrier = true;
150         ret = jbd2_complete_transaction(journal, commit_tid);
151         if (needs_barrier) {
152         issue_flush:
153                 err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
154                 if (!ret)
155                         ret = err;
156         }

jbd2のコードを読む

ext4のコードより、JBD2_BARRIERを追えば良いことがわかっている。

kernel/fs/jbd2/journal.c:jbd2_trans_will_send_data_barrier()より、JBD2_BARRIERがない場合はREQ_OP_FLUSHを送らない予定だとしている。

journal.c

656         if (!(journal->j_flags & JBD2_BARRIER))
657                 return 0;

kernel/fs/jbd2/journal.c:jbd2_write_superblock()より、JBD2_BARRIERがない場合はsuperblockを書くときにREQ_FUA,REQ_PREFLUSHを立てないようにしている。

journal.c

1330         if (!(journal->j_flags & JBD2_BARRIER))
1331                 write_flags &= ~(REQ_FUA | REQ_PREFLUSH);

kernel/fs/jbd2/commit.c:journal_submit_commit_record()より、journal_async_commitにも関わるが、jbd2のcommitブロックを書き出すときにREQ_PREFLUSH, REQ_FUAを指定するかどうかの判断に使っている。あとで出てくるが、journal_async_commitはjournal_submit_commit_record()を呼ぶタイミングも変える。

commit.c

156         if (journal->j_flags & JBD2_BARRIER &&
157             !jbd2_has_feature_async_commit(journal))
158                 ret = submit_bh(REQ_OP_WRITE,
159                         REQ_SYNC | REQ_PREFLUSH | REQ_FUA, bh);
160         else
161                 ret = submit_bh(REQ_OP_WRITE, REQ_SYNC, bh);

kernel/fs/jbd2/commit.c:jbd2_journal_commit_transaction()より、jbd2のcommitブロックを書く前にREQ_OP_FLUSHを送るようにしている。

commit.c

767         /* 
768          * If the journal is not located on the file system device,
769          * then we must flush the file system device before we issue
770          * the commit record
771          */
772         if (commit_transaction->t_need_data_flush &&
773             (journal->j_fs_dev != journal->j_dev) &&
774             (journal->j_flags & JBD2_BARRIER))
775                 blkdev_issue_flush(journal->j_fs_dev, GFP_NOFS, NULL)

kernel/fs/jbd2/commit.c:jbd2_journal_commit_transaction()より、journal_async_commitかつJBD2_BARRIERのときはjbd2のcommitブロックを書いた後にREQ_OP_FLUSHを送るとしている。

commit.c

869         if (!jbd2_has_feature_async_commit(journal)) {
870                 err = journal_submit_commit_record(journal, commit_transaction,
871                                                 &cbh, crc32_sum);
872                 if (err)
873                         __jbd2_journal_abort_hard(journal);
874         }
875         if (cbh)
876                 err = journal_wait_on_commit_record(journal, cbh);
877         if (jbd2_has_feature_async_commit(journal) &&
878             journal->j_flags & JBD2_BARRIER) {
879                 blkdev_issue_flush(journal->j_dev, GFP_NOFS, NULL);
880         }

kernel/fs/jbd2/recovery.c:jbd2_journal_recover()より、recovery(replay)の最後にREQ_OP_FLUSHを送っている。

recovery.c

290         /* Make sure all replayed data is on permanent storage */
291         if (journal->j_flags & JBD2_BARRIER) {
292                 err2 = blkdev_issue_flush(journal->j_fs_dev, GFP_KERNEL, NULL);
293                 if (!err)
294                         err = err2;
295         }

kernel/fs/jbd2/checkpoint.c:jbd2_journal_recover()より、最後にREQ_OP_FLUSHを送っている。大きな視点からjournalの状態を確定させたいみたいだけど、普段からflushのタイミングを気にしているはずだから、これっていらないんじゃ・・・確かにコメントにあるエラー(abort)が出た時は気にしたほうがよいとは思うが。

recovery.c

369 /*
370  * Check the list of checkpoint transactions for the journal to see if
371  * we have already got rid of any since the last update of the log tail
372  * in the journal superblock.  If so, we can instantly roll the
373  * superblock forward to remove those transactions from the log.
374  *
375  * Return <0 on error, 0 on success, 1 if there was nothing to clean up.
376  *
377  * Called with the journal lock held.
378  *
379  * This is the only part of the journaling code which really needs to be
380  * aware of transaction aborts.  Checkpointing involves writing to the
381  * main filesystem area rather than to the journal, so it can proceed
382  * even in abort state, but we must not update the super block if
383  * checkpointing may have failed.  Otherwise, we would lose some metadata
384  * buffers which should be written-back to the filesystem.
385  */
386 
387 int jbd2_cleanup_journal_tail(journal_t *journal)
388 {
389         tid_t           first_tid;
390         unsigned long   blocknr;
391 
392         if (is_journal_aborted(journal))
393                 return -EIO;
394 
395         if (!jbd2_journal_get_log_tail(journal, &first_tid, &blocknr))
396                 return 1;
397         J_ASSERT(blocknr != 0);
398 
399         /*
400          * We need to make sure that any blocks that were recently written out
401          * --- perhaps by jbd2_log_do_checkpoint() --- are flushed out before
402          * we drop the transactions from the journal. It's unlikely this will
403          * be necessary, especially with an appropriately sized journal, but we
404          * need this to guarantee correctness.  Fortunately
405          * jbd2_cleanup_journal_tail() doesn't get called all that often.
406          */
407         if (journal->j_flags & JBD2_BARRIER)
408                 blkdev_issue_flush(journal->j_fs_dev, GFP_NOFS, NULL);
409 
410         return __jbd2_update_log_tail(journal, first_tid, blocknr);
411 }

journal_async_commit

コードを読む

kernel/fs/ext4/super.cより、EXT4_MOUNT_JOURNAL_ASYNC_COMMITが立つことがわかる。

super.c

1333         {Opt_journal_async_commit, "journal_async_commit"},

super.c

1523         {Opt_journal_async_commit, (EXT4_MOUNT_JOURNAL_ASYNC_COMMIT |
1524                                     EXT4_MOUNT_JOURNAL_CHECKSUM),

kernel/fs/ext4/super.c:set_journal_csum_feature_set()より、JOURNAL_ASYNC_COMMITのときにのみjbd2のJBD2_FEATURE_INCOMPAT_ASYNC_COMMITが有効になる。

super.c

3134         if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) {
3135                 ret = jbd2_journal_set_features(sbi->s_journal,
3136                                 compat, 0,
3137                                 JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT |
3138                                 incompat);
3139         } else if (test_opt(sb, JOURNAL_CHECKSUM)) {
3140                 ret = jbd2_journal_set_features(sbi->s_journal,
3141                                 compat, 0,
3142                                 incompat);
3143                 jbd2_journal_clear_features(sbi->s_journal, 0, 0,
3144                                 JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
3145         } else {
3146                 jbd2_journal_clear_features(sbi->s_journal, 0, 0,
3147                                 JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
3148         }

ext4のコードを読む

エラーメッセージを出しているだけっぽい。

kernel/fs/ext4/super.c:ext4_fill_super()より、journalなしではjournal_async_commitを許していない。

super.c

3927         if (!test_opt(sb, NOLOAD) && ext4_has_feature_journal(sb)) {
3928                 if (ext4_load_journal(sb, es, journal_devnum))
3929                         goto failed_mount3a;
3930         } else if (test_opt(sb, NOLOAD) && !(sb->s_flags & MS_RDONLY) &&
3931                    ext4_has_feature_journal_needs_recovery(sb)) {
3932                 ext4_msg(sb, KERN_ERR, "required journal recovery "
3933                        "suppressed and not mounted read-only");
3934                 goto failed_mount_wq;
3935         } else {
3936                 /* Nojournal mode, all journal mount options are illegal */
3937                 if (test_opt2(sb, EXPLICIT_JOURNAL_CHECKSUM)) {
3938                         ext4_msg(sb, KERN_ERR, "can't mount with "
3939                                  "journal_checksum, fs mounted w/o journal");
3940                         goto failed_mount_wq;
3941                 }
3942                 if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) {
3943                         ext4_msg(sb, KERN_ERR, "can't mount with "
3944                                  "journal_async_commit, fs mounted w/o journal");
3945                         goto failed_mount_wq;
3946                 }

kernel/fs/ext4/super.c:ext4_fill_super()より、data=orderedのときはjournal_async_commitを許していない。

super.c

4007         if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_ORDERED_DATA &&
4008             test_opt(sb, JOURNAL_ASYNC_COMMIT)) {
4009                 ext4_msg(sb, KERN_ERR, "can't mount with "
4010                         "journal_async_commit in data=ordered mode");
4011                 goto failed_mount_wq;
4012         }

kernel/fs/ext4/super.c:ext4_remount()より、data=orderedのときはjournal_async_commitを許していない。

c;super.c

4907         } else if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_ORDERED_DATA) {
4908                 if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) {
4909                         ext4_msg(sb, KERN_ERR, "can't mount with "
4910                                 "journal_async_commit in data=ordered mode");
4911                         err = -EINVAL;
4912                         goto restore_opts;
4913                 }
4914         }

jbd2のコードを読む

kernel/include/linux/jbd2.hより、jbd2_has_feature_async_commitを確認すれば良いとわかる。

jbd2.h

1114 JBD2_FEATURE_INCOMPAT_FUNCS(async_commit,       ASYNC_COMMIT)

kernel/fs/jbd2/commit.c:journal_submit_commit_record()より、JBD2_BARRIERにも関わるが、jbd2のcommitブロックを書き出すときにREQ_PREFLUSH, REQ_FUAを指定するかどうかの判断に使っている。

commit.c

156         if (journal->j_flags & JBD2_BARRIER &&
157             !jbd2_has_feature_async_commit(journal))
158                 ret = submit_bh(REQ_OP_WRITE,
159                         REQ_SYNC | REQ_PREFLUSH | REQ_FUA, bh);
160         else
161                 ret = submit_bh(REQ_OP_WRITE, REQ_SYNC, bh);

kernel/fs/jbd2/commit.c:jbd2_journal_commit_transaction()より、journal_submit_commit_record()を呼ぶタイミングを変更している。commit phase 3の最後か、commit phase 5の最初か。

commit.c

777         /* Done it all: now write the commit record asynchronously. */
778         if (jbd2_has_feature_async_commit(journal)) {
779                 err = journal_submit_commit_record(journal, commit_transaction,
780                                                  &cbh, crc32_sum);
781                 if (err)
782                         __jbd2_journal_abort_hard(journal);
783         }

commit.c

869         if (!jbd2_has_feature_async_commit(journal)) {
870                 err = journal_submit_commit_record(journal, commit_transaction,
871                                                 &cbh, crc32_sum);
872                 if (err)
873                         __jbd2_journal_abort_hard(journal);
874         }

commit.c

869         if (!jbd2_has_feature_async_commit(journal)) {
870                 err = journal_submit_commit_record(journal, commit_transaction,
871                                                 &cbh, crc32_sum);
872                 if (err)
873                         __jbd2_journal_abort_hard(journal);
874         }
875         if (cbh)
876                 err = journal_wait_on_commit_record(journal, cbh);
877         if (jbd2_has_feature_async_commit(journal) &&
878             journal->j_flags & JBD2_BARRIER) {
879                 blkdev_issue_flush(journal->j_dev, GFP_NOFS, NULL);
880         }

kernel/fs/jbd2/recovery.c:do_one_pass()より、journal_async_commitの場合に起こりうるcommitブロックのcorruptionの判断をしている。どういうケースかがコメントに書いてある。...でも、「電源断やIOエラーが起こった時にjournal_async_commitだった」のと「それをreplayするときにjournal_async_commitだった」のとは違うような...

recovery.c

658                 case JBD2_COMMIT_BLOCK:
659                         /*     How to differentiate between interrupted commit
660                          *               and journal corruption ?
661                          *
662                          * {nth transaction}
663                          *        Checksum Verification Failed
664                          *                       |
665                          *               ____________________
666                          *              |                    |
667                          *      async_commit             sync_commit
668                          *              |                    |
669                          *              | GO TO NEXT    "Journal Corruption"
670                          *              | TRANSACTION
671                          *              |
672                          * {(n+1)th transanction}
673                          *              |
674                          *       _______|______________
675                          *      |                     |
676                          * Commit block found   Commit block not found
677                          *      |                     |
678                          * "Journal Corruption"       |
679                          *               _____________|_________
680                          *              |                       |
681                          *      nth trans corrupt       OR   nth trans
682                          *      and (n+1)th interrupted     interrupted
683                          *      before commit block
684                          *      could reach the disk.
685                          *      (Cannot find the difference in above
686                          *       mentioned conditions. Hence assume
687                          *       "Interrupted Commit".)
688                          */
(----------snip----------)
735                                         if (!jbd2_has_feature_async_commit(journal)) {
736                                                 journal->j_failed_commit =
737                                                         next_commit_ID;
738                                                 brelse(bh);
739                                                 break;
740                                         }
(----------snip----------)
749                                 if (!jbd2_has_feature_async_commit(journal)) {
750                                         journal->j_failed_commit =
751                                                 next_commit_ID;
752                                         brelse(bh);
753                                         break;
754                                 }

REQ_OP_FLUSH, REQ_FUA, REQ_PREFLUSH,

簡単な説明

今時のストレージはハードウェアのストレージの中にもキャッシュを持っていて、外から見た書き込みが完了したからと言ってnon volatile(不揮発領域(電源断しても消えないとこ))に書いたとは限らない。このキャッシュを明示的にコントロールしようという要求になる。

REQ_OP_FLUSH, REQ_PREFLUSHは、キャッシュに残っている書き出せていないものを書き出せ、という意味になる。

REQ_FUAは、次に書こうとしているものはキャッシュだけに留めることはせずに書き出せ、という意味になる。

初心者向け蛇足な注意事項

プログラマからすると、データ(ファイル)を扱う上で下記のようなキャッシュを意識しないといけない。いや、大体の場合はOSやライブラリが勝手にやるからあまり意識しなくていい(ハズだ)けど。

fflush(3)類、ユーザランドでのキャッシュ
sync(2)類、kernelのブロック層でのバッファキャッシュ
ioprio_get(2)類、ディスクドライバが要求をどの順で処理するかのキューに効く、IO scheduler(/sys/block/sda/queue/schedulerなど)にも注意
今回気にているストレージ側が持っている中のキャッシュ

類似例として、CPUのL1/L2/L3キャッシュとか、一部SoCではDRAM以外のバスにもキャッシュがあったりとか。ただどれも一言で「キャッシュ(cache)」とだけ言われることが多いため、どこのキャッシュのことなのかを常に意識しないといけない。

今時のCPUはGHzオーダででキビキビ動くけど、それを有効に使うには絶え間なくデータを流し込まないといけないので、そのデータを安定探偵して供給するためにそこらじゅうにキャッシュがある。キャッシュがあるとデータ一貫性が崩れるので、一貫性を保つために色々しないといけなくて面倒になる。

定義

kernel/include/linux/blk_types.hで定義されている。OPが付いているほうがリクエストの種類(1種類しか選べない)、OPがついていないほうがフラグ(リクエストに対しビットで複数立てられる)、となっている。

blk_types.h

145 enum req_opf {
146         /* read sectors from the device */
147         REQ_OP_READ             = 0,
148         /* write sectors to the device */
149         REQ_OP_WRITE            = 1,
150         /* flush the volatile write cache */
151         REQ_OP_FLUSH            = 2,
152         /* discard sectors */
153         REQ_OP_DISCARD          = 3,
154         /* get zone information */
155         REQ_OP_ZONE_REPORT      = 4,
156         /* securely erase sectors */
157         REQ_OP_SECURE_ERASE     = 5,
158         /* seset a zone write pointer */
159         REQ_OP_ZONE_RESET       = 6,
160         /* write the same sector many times */
161         REQ_OP_WRITE_SAME       = 7,
162         /* write the zero filled sector many times */
163         REQ_OP_WRITE_ZEROES     = 8,
164 
165         REQ_OP_LAST,
166 };
167 
168 enum req_flag_bits {
169         __REQ_FAILFAST_DEV =    /* no driver retries of device errors */
170                 REQ_OP_BITS,
171         __REQ_FAILFAST_TRANSPORT, /* no driver retries of transport errors */
172         __REQ_FAILFAST_DRIVER,  /* no driver retries of driver errors */
173         __REQ_SYNC,             /* request is sync (sync write or read) */
174         __REQ_META,             /* metadata io request */
175         __REQ_PRIO,             /* boost priority in cfq */
176         __REQ_NOMERGE,          /* don't touch this for merging */
177         __REQ_IDLE,             /* anticipate more IO after this one */
178         __REQ_INTEGRITY,        /* I/O includes block integrity payload */
179         __REQ_FUA,              /* forced unit access */
180         __REQ_PREFLUSH,         /* request for cache flush */
181         __REQ_RAHEAD,           /* read ahead, can fail anytime */
182         __REQ_BACKGROUND,       /* background IO */
183         __REQ_NR_BITS,          /* stops here */
184 };
185 
186 #define REQ_FAILFAST_DEV        (1ULL << __REQ_FAILFAST_DEV)
187 #define REQ_FAILFAST_TRANSPORT  (1ULL << __REQ_FAILFAST_TRANSPORT)
188 #define REQ_FAILFAST_DRIVER     (1ULL << __REQ_FAILFAST_DRIVER)
189 #define REQ_SYNC                (1ULL << __REQ_SYNC)
190 #define REQ_META                (1ULL << __REQ_META)
191 #define REQ_PRIO                (1ULL << __REQ_PRIO)
192 #define REQ_NOMERGE             (1ULL << __REQ_NOMERGE)
193 #define REQ_IDLE                (1ULL << __REQ_IDLE)
194 #define REQ_INTEGRITY           (1ULL << __REQ_INTEGRITY)
195 #define REQ_FUA                 (1ULL << __REQ_FUA)
196 #define REQ_PREFLUSH            (1ULL << __REQ_PREFLUSH)
197 #define REQ_RAHEAD              (1ULL << __REQ_RAHEAD)
198 #define REQ_BACKGROUND          (1ULL << __REQ_BACKGROUND)

ちなみに、このREQ_BACKGROUNDがLinux-4.10での目玉の機能(writebackの改善)らしいです。

Documentation

kernel/Documentation/block/writeback_cache_control.txtにそれっぽいドキュメントがある。

REQ_FUA

scsi(sd)の場合

scsi disk(sd)だと、WRITE_32, WRITE_16 コマンドの時の FUA_NVビットを指定している。
SCSI仕様書(PDF)によれば、

WRITE: If FUA = 1, all data must be written to the media before the SCSI operation returns the status and completion message bytes

eMMCの場合

reliable writeをする。JEDECによるとreliable writeは、cacheに留めないwriteであることが規定されている。http://www.jedec.org/sites/default/files/docs/JESD84-B451.pdf は会員しか見れない。...JEDECの仕様書って売り物だからネットにタダで見れる状態で置いちゃまずかったんじゃなかったっけ。ググれば普通に見えちゃうサイトもある件。

REQ_PREFLUSH

該当のリクエストを処理する前にREQ_OP_FLUSHがリクエストされたのと同じことをする、という意味になる。

REQ_OP_FLUSH

scsi(sd)の場合

SYNCHRONIZE_CACHE (SYNCHRONIZE CACHE 10コマンド)を送っている。

eMMCの場合

EXT_CSD_FLUSH_CACHE (index 32)に1を書いている

あとがき

journal_async_commitはともかく、nobarrierはRAIDやlvmなど複数のストレージに分散する時ときにキャッシュが複数に分かれてタイミングが読めなくなるので、不用意にnobarrierをつけると電源断に弱くなると思われる。大事なデータだとUPS類なしには使えない。逆に単体ストレージだと、中のファームウェア次第だけど、比較的危険が少ないと思われる。

ただいずれも、中のファームウェア次第なので、fuaやcacheのコマンドにウソついている(「完了」といいつつ実際には何もしていない)と、ext4/jbd2がやっている上記のようなことが意味をなさなくなる。逆のことも起こりえて、5年ほど前はよくあった「プチフリSSD」のように、処理の適正化が進んでいないファームウェアでパフォーマンスがダダ落ちになってしまったりと。

ディスク書き出しに関わるとこはおおむね見終わったため、これでext4関連の記事は終わりの予定です。気になる方は過去の記事も参考にしてもらえればと思います。

...あ、Linux-4.10でLogFS消されてる...

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up