More than 5 years have passed since last update.

Linuxのdrop_cachesにwriteした時の動きを追う

Last updated at 2017-08-20Posted at 2017-08-18

はじめに

drop_cachesにwriteしてみて、その前後での**/proc/meminfoやfree(1)**コマンド結果を観察するような記事はたくさんあるけど、drop_cachesにwriteしたときに何をやっているのかを詳しく解説したような記事が全然見つからなかったので、自分で調べてみることにした。

・・・という間違いを犯して泥沼にハマり貴重な休みを潰してしまったとあるエンジニアの活動を記録した記事である(たぶん)

なお、Linux-4.12くらい、procps-ng-3.3.12くらいを見ています。

ページキャッシュの概要

概要

そもそも通常は、あえてdrop_cachesに値を書いて操作する必要が出るような場面はないと思われる。敷いていえば、ページキャッシュに乗ってる場合と乗っていない場合とでのベンチマークをしたいときくらい？

まれに**/proc/meminfoのMemFreeが少ないのを見て「メモリが足りない・キャッシュを開放しなきゃ」という人がいるが、多くの場合それは誤解している。そもそも/proc/meminfoで表示されるMemFreeが誤解させるもので悪いんだ、という理由なのかどうかは知らないけど、Linux-3.14でMemAvailableというのが作られた。パフォーマンスまでは気にしていなくて単純にメモリが足りているかどうかを気にする場合は、おおむねMemAvailable**を見ておけば良いとなる。

RHEL6.6にはわざわざバックポートされていたり、freeコマンドも改善されていたり(procps-ng-3.3.10から)、みんなMemAvailableが大好きみたいです。うんだってわかりやすいしね。

・・・あれ、これじゃ全然ページキャッシュの概要じゃない・・・

本当の概要

Linuxではどんなデータであっても一度メモリ(RAM)にのせないと処理ができない。読みたいファイルやら書いたファイルやら、プログラムそのものやら。このため、ファイルとして表現されたデータを効率よくRAMで扱えるようにするため、ページキャッシュの仕組みがある。

単純に言うと、一度でもread/writeされたファイルはすべてページ(x86_64だとだいたいは4KB)の単位でRAM上にキャッシュされる。こうすることで、次回使うときにディスクから読み直さなくても済むようにしていて、パフォーマンスが上がる。で、どこまでキャッシュするかというと、空きメモリをほぼ食い尽くすくらいまでキャッシュしまくる。なので、長時間動かしていたりたくさんファイルをreadしたりしていると空きメモリがなくなっていく。

空きメモリがなくなると、そのままだといざ本当にメモリが欲しくなったときに困る。じゃどうするかというと、メモリが欲しくなったときに都度使ってなさそうな(LRU使って「最近使っていない」ページキャッシュを探す)ページキャッシュを開放して空きメモリを確保する。もちろん、本当に都度やってると処理が無駄に増えるので、やるときにたくさん目に空きメモリを確保したり、「そろそろ厳しいかな」とわかったときにkswapdというkernelスレッドが自主的に空きメモリを確保したりはしている。

パフォーマンスを上げるためにはできるだけキャッシュしておきたいけど、でもキャッシュしすぎると空きメモリがほしいときに開放処理が入ることですぐに対応できなくなる、そのへんのせめぎあいで多段にページキャッシュの処理が複雑になっている。

なお一部データベースシステムでは、kernelのページキャッシュの処理を介するとかえってパフォーマンスが落ちる、ということで、ファイルであっても自分でメモリ管理していたりする。典型的なのがopen(2)のO_DIRECTで、この実装は「Linux Kernelを信じられない」という主張とも言えるわけで、**open(2)**のマニュアルにはLinusのグチが書いてあったりする。

「O_DIRECT でいつも困るのは、インターフェース全部が本当にお馬鹿な点だ。たぶん危ないマインドコントロール剤で頭がおかしくなったサルが設計したんじゃないかな」 --- Linus

原文だと、

"The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances."—Linus

ちなみにLinus Torvaldsは暴言集でも有名(？)で、例えば適当にぐぐってみたらこんなの(BrainyQuotes)があったので、午後の憂鬱な会議の前に読むと~~現実逃避~~気分転換になるかもしれない。

・・・と書いては見たものの、これでも全然ページキャッシュの概要になってないような。もういいや・・・

Documentation

drop_caches

kernel/Documentation/sysctl/vm.txtより、

Writing to this will cause the kernel to drop clean caches, as well as
reclaimable slab objects like dentries and inodes.  Once dropped, their
memory becomes free.

To free pagecache:
	echo 1 > /proc/sys/vm/drop_caches
To free reclaimable slab objects (includes dentries and inodes):
	echo 2 > /proc/sys/vm/drop_caches
To free slab objects and pagecache:
	echo 3 > /proc/sys/vm/drop_caches

This is a non-destructive operation and will not free any dirty objects.
To increase the number of objects freed by this operation, the user may run
`sync' prior to writing to /proc/sys/vm/drop_caches.  This will minimize the
number of dirty objects on the system and create more candidates to be
dropped.

This file is not a means to control the growth of the various kernel caches
(inodes, dentries, pagecache, etc...)  These objects are automatically
reclaimed by the kernel when memory is needed elsewhere on the system.

Use of this file can cause performance problems.  Since it discards cached
objects, it may cost a significant amount of I/O and CPU to recreate the
dropped objects, especially if they were under heavy use.  Because of this,
use outside of a testing or debugging environment is not recommended.

You may see informational messages in your kernel log when this file is
used:

	cat (1234): drop_caches: 3

These are informational only.  They do not mean that anything is wrong
with your system.  To disable them, echo 4 (bit 3) into drop_caches.

「破壊的」ではない(不可逆な状態変化があるわけではない)
dirtyなinodeも処理したいなら、drop_cachesに書くより先に**sync(1)**するといいよ
パフォーマンスに影響があるので、デバッグやテストの目的以外でwriteするのは推奨しない
ビット3を加えた値をwriteすると「操作した」というログが残らないよ

あたりが特筆事項か。あれ、ゼロから数えるからビット2じゃないの？

ちなみに、続 @ITのmeminfoの見方の説明が完全に間違っている件について (革命の日々その２)のコメント欄にある通り、writeする行為に意味があり、drop_cachesで覚えている値にはなんの状態の意味も持たない。

MemAvailable

ついでなのでMemAvailableも。kernel/Documentation/filesystems/proc.txtより、

MemAvailable: An estimate of how much memory is available for starting new
              applications, without swapping. Calculated from MemFree,
              SReclaimable, the size of the file LRU lists, and the low
              watermarks in each zone.
              The estimate takes into account that the system needs some
              page cache to function well, and that not all reclaimable
              slab will be reclaimable, due to items being in use. The
              impact of those factors will vary from system to system.

なんかいろいろ言い訳がましく書いてあるけど、結局のところ、「ユーザプロセスが使えるメモリがどのくらいになるのかは実際にメモリ確保してみないとわからない」というなんとも非決定論的な実装になっているという現実からだったりする。

kernelのソースを読む(drop_caches)

drop_cachesのインターフェース

kernel/kernel/sysctl.cのctl_table vm_table[]より、

sysctl.c

	{
		.procname	= "drop_caches",
		.data		= &sysctl_drop_caches,
		.maxlen		= sizeof(int),
		.mode		= 0644,
		.proc_handler	= drop_caches_sysctl_handler,
		.extra1		= &one,
		.extra2		= &four,
	},

**drop_caches_sysctl_handler()**は、kernel/fs/drop_caches.cより、

drop_caches.c

int drop_caches_sysctl_handler(struct ctl_table *table, int write,
	void __user *buffer, size_t *length, loff_t *ppos)
{
	int ret;

	ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
	if (ret)
		return ret;
	if (write) {
		static int stfu;

		if (sysctl_drop_caches & 1) {
			iterate_supers(drop_pagecache_sb, NULL);
			count_vm_event(DROP_PAGECACHE);
		}
		if (sysctl_drop_caches & 2) {
			drop_slab();
			count_vm_event(DROP_SLAB);
		}
		if (!stfu) {
			pr_info("%s (%d): drop_caches: %d\n",
				current->comm, task_pid_nr(current),
				sysctl_drop_caches);
		}
		stfu |= sysctl_drop_caches & 4;
	}
	return 0;
}

**proc_dointvec_minmax()**を使っているので、先の.extra1(==one)が最小値、.extra2(==four)が最大値であるとわかる。なので、

1を書くとpagecacheの開放をする
2を書くとslabの開放をする
3を書くとpagecacheとslabの両方の開放をする
4を書くとログ(KERN_INFO)への出力のON/OFFをする

となる。また、グローバル変数のsysctl_drop_cachesはここ以外では参照されないことから「drop_cachesで覚えている値にはなんの状態の意味も持たない」ことが裏付けられる。

Linuxでキャッシュを追い出す方法 - Qiitaには

2．ダーティーキャッシュとinode

とあるが、そんなことを書いたページやドキュメントどこにも見当たらないんだけど、一体どこを見たらそんな間違いを書いちゃうの・・・ひょっとして「dentry」を「dirtyなentry(==inode)」と間違った？dentryはdirectory entryですね。

ページキャッシュ開放の処理(drop_pagecache_sb())

**drop_caches_sysctl_handler()**の続き。**iterate_supers()**は、kernel/fs/super.cより、

super.c

/**
 *	iterate_supers - call function for all active superblocks
 *	@f: function to call
 *	@arg: argument to pass to it
 *
 *	Scans the superblock list and calls given function, passing it
 *	locked superblock and given argument.
 */
void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
{
	struct super_block *sb, *p = NULL;

	spin_lock(&sb_lock);
	list_for_each_entry(sb, &super_blocks, s_list) {
		if (hlist_unhashed(&sb->s_instances))
			continue;
		sb->s_count++;
		spin_unlock(&sb_lock);

		down_read(&sb->s_umount);
		if (sb->s_root && (sb->s_flags & MS_BORN))
			f(sb, arg);
		up_read(&sb->s_umount);

		spin_lock(&sb_lock);
		if (p)
			__put_super(p);
		p = sb;
	}
	if (p)
		__put_super(p);
	spin_unlock(&sb_lock);
}

排他を気にしつつ素直にiterateしている。なので、mountしているFilesystemごとに**drop_pagecache_sb()**を呼ぶという処理をしていることになる。**drop_pagecache_sb()**は、kernel/fs/drop_caches.cより、

drop_caches.c

static void drop_pagecache_sb(struct super_block *sb, void *unused)
{
	struct inode *inode, *toput_inode = NULL;

	spin_lock(&sb->s_inode_list_lock);
	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
		spin_lock(&inode->i_lock);
		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
		    (inode->i_mapping->nrpages == 0)) {
			spin_unlock(&inode->i_lock);
			continue;
		}
		__iget(inode);
		spin_unlock(&inode->i_lock);
		spin_unlock(&sb->s_inode_list_lock);

		invalidate_mapping_pages(inode->i_mapping, 0, -1);
		iput(toput_inode);
		toput_inode = inode;

		spin_lock(&sb->s_inode_list_lock);
	}
	spin_unlock(&sb->s_inode_list_lock);
	iput(toput_inode);
}

ここも、いろいろ排他を気にしつつ、Filesystem内のinodeそれぞれのi_mappingに対して**invalidate_mapping_pages()**を呼んでいる。0から(unsigned long(==pgoff_t)の)-1までなので、mappingの全部を指定している。kernel/mm/truncate.cより、

mm.c

/**
 * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode
 * @mapping: the address_space which holds the pages to invalidate
 * @start: the offset 'from' which to invalidate
 * @end: the offset 'to' which to invalidate (inclusive)
 *
 * This function only removes the unlocked pages, if you want to
 * remove all the pages of one inode, you must call truncate_inode_pages.
 *
 * invalidate_mapping_pages() will not block on IO activity. It will not
 * invalidate pages which are dirty, locked, under writeback or mapped into
 * pagetables.
 */
unsigned long invalidate_mapping_pages(struct address_space *mapping,
		pgoff_t start, pgoff_t end)
{
	pgoff_t indices[PAGEVEC_SIZE];
	struct pagevec pvec;
	pgoff_t index = start;
	unsigned long ret;
	unsigned long count = 0;
	int i;

	pagevec_init(&pvec, 0);
	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
			indices)) {
		for (i = 0; i < pagevec_count(&pvec); i++) {
			struct page *page = pvec.pages[i];

			/* We rely upon deletion not changing page->index */
			index = indices[i];
			if (index > end)
				break;

			if (radix_tree_exceptional_entry(page)) {
				invalidate_exceptional_entry(mapping, index,
							     page);
				continue;
			}

			if (!trylock_page(page))
				continue;

			WARN_ON(page_to_index(page) != index);

			/* Middle of THP: skip */
			if (PageTransTail(page)) {
				unlock_page(page);
				continue;
			} else if (PageTransHuge(page)) {
				index += HPAGE_PMD_NR - 1;
				i += HPAGE_PMD_NR - 1;
				/* 'end' is in the middle of THP */
				if (index ==  round_down(end, HPAGE_PMD_NR))
					continue;
			}

			ret = invalidate_inode_page(page);
			unlock_page(page);
			/*
			 * Invalidation is a hint that the page is no longer
			 * of interest and try to speed up its reclaim.
			 */
			if (!ret)
				deactivate_file_page(page);
			count += ret;
		}
		pagevec_remove_exceptionals(&pvec);
		pagevec_release(&pvec);
		cond_resched();
		index++;
	}
	return count;
}

mmの中に入ってきてしまった。こういう関数はコメントをちゃんと読むに限る。

I/Oの動作をブロックしたりはしない
下記のようなmappingまでは回収しない
- dirty, locked, writeback中, (ユーザプロセスが使っているという意味で)pagetablesに入っている

いろいろページの状態チェックやらwalkの仕方やらがややこしいけど、結局のところは、**invalidate_inode_page()とdeactivate_file_page()**を呼ぶところが本体っぽい。**invalidate_inode_page()**は、kernel/mm/truncate.cより、

truncate.c

/*
 * Safely invalidate one page from its pagecache mapping.
 * It only drops clean, unused pages. The page must be locked.
 *
 * Returns 1 if the page is successfully invalidated, otherwise 0.
 */
int invalidate_inode_page(struct page *page)
{
	struct address_space *mapping = page_mapping(page);
	if (!mapping)
		return 0;
	if (PageDirty(page) || PageWriteback(page))
		return 0;
	if (page_mapped(page))
		return 0;
	return invalidate_complete_page(mapping, page);
}

truncate.c

/*
 * This is for invalidate_mapping_pages().  That function can be called at
 * any time, and is not supposed to throw away dirty pages.  But pages can
 * be marked dirty at any time too, so use remove_mapping which safely
 * discards clean, unused pages.
 *
 * Returns non-zero if the page was successfully invalidated.
 */
static int
invalidate_complete_page(struct address_space *mapping, struct page *page)
{
	int ret;

	if (page->mapping != mapping)
		return 0;

	if (page_has_private(page) && !try_to_release_page(page, 0))
		return 0;

	ret = remove_mapping(mapping, page);

	return ret;
}

**remove_mapping()からは__remove_mapping()とか__delete_from_page_cache()**とかさらに呼ばれていくけど、ページ管理の細かいところにいってしまうので、このへんまでにしておく。**deactivate_file_page()**の方はkernel/mm/swap.cより、

swap.c

/**
 * deactivate_file_page - forcefully deactivate a file page
 * @page: page to deactivate
 *
 * This function hints the VM that @page is a good reclaim candidate,
 * for example if its invalidation fails due to the page being dirty
 * or under writeback.
 */
void deactivate_file_page(struct page *page)
{
	/*
	 * In a workload with many unevictable page such as mprotect,
	 * unevictable page deactivation for accelerating reclaim is pointless.
	 */
	if (PageUnevictable(page))
		return;

	if (likely(get_page_unless_zero(page))) {
		struct pagevec *pvec = &get_cpu_var(lru_deactivate_file_pvecs);

		if (!pagevec_add(pvec, page) || PageCompound(page))
			pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
		put_cpu_var(lru_deactivate_file_pvecs);
	}
}

ここでも本当に開放して良いページなのかどうかが入念にチェックされている。lru_deactivate_file_fn()でさらにチェックしつつ、開放してよいものをpvec(==lru_deactivate_file_pvecs)につないで、あとで非同期にまとめて開放しているようだ(**lru_add_drain_work()**あたり？) こっちも細かいところにってしまうので、このへんまでで。

slab開放の処理(drop_slab())

**drop_caches_sysctl_handler()に戻り、今度はdrop_slab()**を。kernel/mm/vmscan.cより、

vmscan.c

void drop_slab_node(int nid)
{
	unsigned long freed;

	do {
		struct mem_cgroup *memcg = NULL;

		freed = 0;
		do {
			freed += shrink_slab(GFP_KERNEL, nid, memcg,
					     1000, 1000);
		} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
	} while (freed > 10);
}

void drop_slab(void)
{
	int nid;

	for_each_online_node(nid)
		drop_slab_node(nid);
}

いきなりmmにやってきてしまった。イヤになる気持ちを抑えつつ、**for_each_online_node()**はkernel/include/linux/nodemask.hより、

nodemask.h

# define for_each_online_node(node) for_each_node_state(node, N_ONLINE)

nodemask.h

# define for_each_node_state(__node, __state) \
	for_each_node_mask((__node), node_states[__state])

nodemask.h

# if MAX_NUMNODES > 1
# define for_each_node_mask(node, mask)			\
	for ((node) = first_node(mask);			\
		(node) < MAX_NUMNODES;			\
		(node) = next_node((node), (mask)))
# else /* MAX_NUMNODES == 1 */
# define for_each_node_mask(node, mask)			\
	if (!nodes_empty(mask))				\
		for ((node) = 0; (node) < 1; (node)++)
# endif /* MAX_NUMNODES */

結局のところはNODEごとにiterateしているだけっぽい。そもそもNODEって一体何なのかがよくわからない。CONFIG_NUMAとCONFIG_MEMCGがらみっぽいけど。シンプルに処理を追いたいだけなら、NODEは1つでCONFIG_NUMAとCONFIG_MEMCGは無効の側を見るのでいいとは思う。

(※追記：コメント補足より、このNODEはCONFIG_NUMA環境におけるノードのことで良い。CONFIG_NUMAについてはとりあえずKconfigあたりへ・・・)

で、**drop_slab_node()に戻り、結局はshrink_slab()**を呼んでいる。kernel/mm/vmscan.cより、

vmscan.c

/**
 * shrink_slab - shrink slab caches
 * @gfp_mask: allocation context
 * @nid: node whose slab caches to target
 * @memcg: memory cgroup whose slab caches to target
 * @nr_scanned: pressure numerator
 * @nr_eligible: pressure denominator
 *
 * Call the shrink functions to age shrinkable caches.
 *
 * @nid is passed along to shrinkers with SHRINKER_NUMA_AWARE set,
 * unaware shrinkers will receive a node id of 0 instead.
 *
 * @memcg specifies the memory cgroup to target. If it is not NULL,
 * only shrinkers with SHRINKER_MEMCG_AWARE set will be called to scan
 * objects from the memory cgroup specified. Otherwise, only unaware
 * shrinkers are called.
 *
 * @nr_scanned and @nr_eligible form a ratio that indicate how much of
 * the available objects should be scanned.  Page reclaim for example
 * passes the number of pages scanned and the number of pages on the
 * LRU lists that it considered on @nid, plus a bias in @nr_scanned
 * when it encountered mapped pages.  The ratio is further biased by
 * the ->seeks setting of the shrink function, which indicates the
 * cost to recreate an object relative to that of an LRU page.
 *
 * Returns the number of reclaimed slab objects.
 */
static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
				 struct mem_cgroup *memcg,
				 unsigned long nr_scanned,
				 unsigned long nr_eligible)
{
	struct shrinker *shrinker;
	unsigned long freed = 0;

	if (memcg && (!memcg_kmem_enabled() || !mem_cgroup_online(memcg)))
		return 0;

	if (nr_scanned == 0)
		nr_scanned = SWAP_CLUSTER_MAX;

	if (!down_read_trylock(&shrinker_rwsem)) {
		/*
		 * If we would return 0, our callers would understand that we
		 * have nothing else to shrink and give up trying. By returning
		 * 1 we keep it going and assume we'll be able to shrink next
		 * time.
		 */
		freed = 1;
		goto out;
	}

	list_for_each_entry(shrinker, &shrinker_list, list) {
		struct shrink_control sc = {
			.gfp_mask = gfp_mask,
			.nid = nid,
			.memcg = memcg,
		};

		/*
		 * If kernel memory accounting is disabled, we ignore
		 * SHRINKER_MEMCG_AWARE flag and call all shrinkers
		 * passing NULL for memcg.
		 */
		if (memcg_kmem_enabled() &&
		    !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
			continue;

		if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
			sc.nid = 0;

		freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
	}

	up_read(&shrinker_rwsem);
out:
	cond_resched();
	return freed;
}

かなり嫌らしい関数なので、ここもコメントをちゃんと理解したほうが良さげ。と言いつつも、あまり突っ込んだことはコメントには書かれていない。結局のところ、slab(kernel内で使うメモリ管理機構)から確保したメモリを開放するためのモジュールを登録するshrinkerを順に呼んでいる・・・たぶん。で、そのshrinkerごとに**do_shrink_slab()**を呼んでいる。kernel/mm/vmscan.cより、

vmscan.c

# define SHRINK_BATCH 128

static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
				    struct shrinker *shrinker,
				    unsigned long nr_scanned,
				    unsigned long nr_eligible)
{
	unsigned long freed = 0;
	unsigned long long delta;
	long total_scan;
	long freeable;
	long nr;
	long new_nr;
	int nid = shrinkctl->nid;
	long batch_size = shrinker->batch ? shrinker->batch
					  : SHRINK_BATCH;
	long scanned = 0, next_deferred;

	freeable = shrinker->count_objects(shrinker, shrinkctl);
	if (freeable == 0)
		return 0;

	/*
	 * copy the current shrinker scan count into a local variable
	 * and zero it so that other concurrent shrinker invocations
	 * don't also do this scanning work.
	 */
	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);

	total_scan = nr;
	delta = (4 * nr_scanned) / shrinker->seeks;
	delta *= freeable;
	do_div(delta, nr_eligible + 1);
	total_scan += delta;
	if (total_scan < 0) {
		pr_err("shrink_slab: %pF negative objects to delete nr=%ld\n",
		       shrinker->scan_objects, total_scan);
		total_scan = freeable;
		next_deferred = nr;
	} else
		next_deferred = total_scan;

	/*
	 * We need to avoid excessive windup on filesystem shrinkers
	 * due to large numbers of GFP_NOFS allocations causing the
	 * shrinkers to return -1 all the time. This results in a large
	 * nr being built up so when a shrink that can do some work
	 * comes along it empties the entire cache due to nr >>>
	 * freeable. This is bad for sustaining a working set in
	 * memory.
	 *
	 * Hence only allow the shrinker to scan the entire cache when
	 * a large delta change is calculated directly.
	 */
	if (delta < freeable / 4)
		total_scan = min(total_scan, freeable / 2);

	/*
	 * Avoid risking looping forever due to too large nr value:
	 * never try to free more than twice the estimate number of
	 * freeable entries.
	 */
	if (total_scan > freeable * 2)
		total_scan = freeable * 2;

	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
				   nr_scanned, nr_eligible,
				   freeable, delta, total_scan);

	/*
	 * Normally, we should not scan less than batch_size objects in one
	 * pass to avoid too frequent shrinker calls, but if the slab has less
	 * than batch_size objects in total and we are really tight on memory,
	 * we will try to reclaim all available objects, otherwise we can end
	 * up failing allocations although there are plenty of reclaimable
	 * objects spread over several slabs with usage less than the
	 * batch_size.
	 *
	 * We detect the "tight on memory" situations by looking at the total
	 * number of objects we want to scan (total_scan). If it is greater
	 * than the total number of objects on slab (freeable), we must be
	 * scanning at high prio and therefore should try to reclaim as much as
	 * possible.
	 */
	while (total_scan >= batch_size ||
	       total_scan >= freeable) {
		unsigned long ret;
		unsigned long nr_to_scan = min(batch_size, total_scan);

		shrinkctl->nr_to_scan = nr_to_scan;
		ret = shrinker->scan_objects(shrinker, shrinkctl);
		if (ret == SHRINK_STOP)
			break;
		freed += ret;

		count_vm_events(SLABS_SCANNED, nr_to_scan);
		total_scan -= nr_to_scan;
		scanned += nr_to_scan;

		cond_resched();
	}

	if (next_deferred >= scanned)
		next_deferred -= scanned;
	else
		next_deferred = 0;
	/*
	 * move the unused scan count back into the shrinker in a
	 * manner that handles concurrent updates. If we exhausted the
	 * scan, there is no need to do an update.
	 */
	if (next_deferred > 0)
		new_nr = atomic_long_add_return(next_deferred,
						&shrinker->nr_deferred[nid]);
	else
		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);

	trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, total_scan);
	return freed;
}

一度に開放する量やら呼ばれる頻度やらを気にして調整している様子はわかるが、正直細かいところはわかりにくい。traceがあるので、ftraceを使えばshrinkerが順に呼ばれる様子を捉えやすいと思われる。**down_read_trylock()**を使っていたりするので、たまたまタイミング競合した場合はshrinkerが呼ばれないこともある。

kernelのソースを読む(inodeとdentryのshrinker具体例)

dentryのallocate

**shrink_slab()**では、dentryやinodeに限らず、すべてのshrinkerを呼んでいる。とはいえ、なんとも掴み所がないので、具体的にdentryのshrinkerあたりを追うことにする。まずはdentryをallocateしているところからで、kernel/fs/dcache.cより、

dcache.c

/**
 * d_alloc	-	allocate a dcache entry
 * @parent: parent of entry to allocate
 * @name: qstr of the name
 *
 * Allocates a dentry. It returns %NULL if there is insufficient memory
 * available. On a success the dentry is returned. The name passed in is
 * copied and the copy passed in may be reused after this call.
 */
struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
{
	struct dentry *dentry = __d_alloc(parent->d_sb, name);
	if (!dentry)
		return NULL;
	dentry->d_flags |= DCACHE_RCUACCESS;
	spin_lock(&parent->d_lock);
	/*
	 * don't need child lock because it is not subject
	 * to concurrency here
	 */
	__dget_dlock(parent);
	dentry->d_parent = parent;
	list_add(&dentry->d_child, &parent->d_subdirs);
	spin_unlock(&parent->d_lock);

	return dentry;
}
EXPORT_SYMBOL(d_alloc);

新しいnameに対してlookupなどされるたびにこの関数が呼ばれてdentryが作られる。ただ類似関数は他にもあり、上記とは違う系からdentryが作られることもある。このため、ちゃんと追うには、直接**__d_alloc()**を呼んでいる箇所も見ておいたほうがいいかもしれない。

詳細は省略するが、inodeの場合はkernel/fs/inode.cのalloc_inode()あたりで確保している。

shrinkerの登録

これに対応するshrinkerは**sget_userns()**で登録している。kernel/fs/super.cより、

super.c

/**
 *	sget_userns -	find or create a superblock
 *	@type:	filesystem type superblock should belong to
 *	@test:	comparison callback
 *	@set:	setup callback
 *	@flags:	mount flags
 *	@user_ns: User namespace for the super_block
 *	@data:	argument to each of them
 */
struct super_block *sget_userns(struct file_system_type *type,
			int (*test)(struct super_block *,void *),
			int (*set)(struct super_block *,void *),
			int flags, struct user_namespace *user_ns,
			void *data)
{
	struct super_block *s = NULL;
	struct super_block *old;
	int err;

	if (!(flags & (MS_KERNMOUNT|MS_SUBMOUNT)) &&
	    !(type->fs_flags & FS_USERNS_MOUNT) &&
	    !capable(CAP_SYS_ADMIN))
		return ERR_PTR(-EPERM);
retry:
	spin_lock(&sb_lock);
	if (test) {
		hlist_for_each_entry(old, &type->fs_supers, s_instances) {
			if (!test(old, data))
				continue;
			if (user_ns != old->s_user_ns) {
				spin_unlock(&sb_lock);
				if (s) {
					up_write(&s->s_umount);
					destroy_super(s);
				}
				return ERR_PTR(-EBUSY);
			}
			if (!grab_super(old))
				goto retry;
			if (s) {
				up_write(&s->s_umount);
				destroy_super(s);
				s = NULL;
			}
			return old;
		}
	}
	if (!s) {
		spin_unlock(&sb_lock);
		s = alloc_super(type, (flags & ~MS_SUBMOUNT), user_ns);
		if (!s)
			return ERR_PTR(-ENOMEM);
		goto retry;
	}
		
	err = set(s, data);
	if (err) {
		spin_unlock(&sb_lock);
		up_write(&s->s_umount);
		destroy_super(s);
		return ERR_PTR(err);
	}
	s->s_type = type;
	strlcpy(s->s_id, type->name, sizeof(s->s_id));
	list_add_tail(&s->s_list, &super_blocks);
	hlist_add_head(&s->s_instances, &type->fs_supers);
	spin_unlock(&sb_lock);
	get_filesystem(type);
	register_shrinker(&s->s_shrink);
	return s;
}

register_shrinker()を呼んで登録している。その時登録されるs_shrinkは、kernel/fs/super.cのalloc_super()より、

super.c

/**
 *	alloc_super	-	create new superblock
 *	@type:	filesystem type superblock should belong to
 *	@flags: the mount flags
 *	@user_ns: User namespace for the super_block
 *
 *	Allocates and initializes a new &struct super_block.  alloc_super()
 *	returns a pointer new superblock or %NULL if allocation had failed.
 */
static struct super_block *alloc_super(struct file_system_type *type, int flags,
				       struct user_namespace *user_ns)
{
	struct super_block *s = kzalloc(sizeof(struct super_block),  GFP_USER);
	static const struct super_operations default_op;
	int i;

	if (!s)
		return NULL;

	INIT_LIST_HEAD(&s->s_mounts);
	s->s_user_ns = get_user_ns(user_ns);

	if (security_sb_alloc(s))
		goto fail;

	for (i = 0; i < SB_FREEZE_LEVELS; i++) {
		if (__percpu_init_rwsem(&s->s_writers.rw_sem[i],
					sb_writers_name[i],
					&type->s_writers_key[i]))
			goto fail;
	}
	init_waitqueue_head(&s->s_writers.wait_unfrozen);
	s->s_bdi = &noop_backing_dev_info;
	s->s_flags = flags;
	if (s->s_user_ns != &init_user_ns)
		s->s_iflags |= SB_I_NODEV;
	INIT_HLIST_NODE(&s->s_instances);
	INIT_HLIST_BL_HEAD(&s->s_anon);
	mutex_init(&s->s_sync_lock);
	INIT_LIST_HEAD(&s->s_inodes);
	spin_lock_init(&s->s_inode_list_lock);
	INIT_LIST_HEAD(&s->s_inodes_wb);
	spin_lock_init(&s->s_inode_wblist_lock);

	if (list_lru_init_memcg(&s->s_dentry_lru))
		goto fail;
	if (list_lru_init_memcg(&s->s_inode_lru))
		goto fail;

	init_rwsem(&s->s_umount);
	lockdep_set_class(&s->s_umount, &type->s_umount_key);
	/*
	 * sget() can have s_umount recursion.
	 *
	 * When it cannot find a suitable sb, it allocates a new
	 * one (this one), and tries again to find a suitable old
	 * one.
	 *
	 * In case that succeeds, it will acquire the s_umount
	 * lock of the old one. Since these are clearly distrinct
	 * locks, and this object isn't exposed yet, there's no
	 * risk of deadlocks.
	 *
	 * Annotate this by putting this lock in a different
	 * subclass.
	 */
	down_write_nested(&s->s_umount, SINGLE_DEPTH_NESTING);
	s->s_count = 1;
	atomic_set(&s->s_active, 1);
	mutex_init(&s->s_vfs_rename_mutex);
	lockdep_set_class(&s->s_vfs_rename_mutex, &type->s_vfs_rename_key);
	mutex_init(&s->s_dquot.dqio_mutex);
	s->s_maxbytes = MAX_NON_LFS;
	s->s_op = &default_op;
	s->s_time_gran = 1000000000;
	s->cleancache_poolid = CLEANCACHE_NO_POOL;

	s->s_shrink.seeks = DEFAULT_SEEKS;
	s->s_shrink.scan_objects = super_cache_scan;
	s->s_shrink.count_objects = super_cache_count;
	s->s_shrink.batch = 1024;
	s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
	return s;

fail:
	destroy_super(s);
	return NULL;
}

super_cache_scan(), **super_cache_count()を指定している。実際の開放処理の時に走るsuper_cache_scan()**は、kernel/fs/super.cより、

super.c

/*
 * One thing we have to be careful of with a per-sb shrinker is that we don't
 * drop the last active reference to the superblock from within the shrinker.
 * If that happens we could trigger unregistering the shrinker from within the
 * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
 * take a passive reference to the superblock to avoid this from occurring.
 */
static unsigned long super_cache_scan(struct shrinker *shrink,
				      struct shrink_control *sc)
{
	struct super_block *sb;
	long	fs_objects = 0;
	long	total_objects;
	long	freed = 0;
	long	dentries;
	long	inodes;

	sb = container_of(shrink, struct super_block, s_shrink);

	/*
	 * Deadlock avoidance.  We may hold various FS locks, and we don't want
	 * to recurse into the FS that called us in clear_inode() and friends..
	 */
	if (!(sc->gfp_mask & __GFP_FS))
		return SHRINK_STOP;

	if (!trylock_super(sb))
		return SHRINK_STOP;

	if (sb->s_op->nr_cached_objects)
		fs_objects = sb->s_op->nr_cached_objects(sb, sc);

	inodes = list_lru_shrink_count(&sb->s_inode_lru, sc);
	dentries = list_lru_shrink_count(&sb->s_dentry_lru, sc);
	total_objects = dentries + inodes + fs_objects + 1;
	if (!total_objects)
		total_objects = 1;

	/* proportion the scan between the caches */
	dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
	inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
	fs_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects);

	/*
	 * prune the dcache first as the icache is pinned by it, then
	 * prune the icache, followed by the filesystem specific caches
	 *
	 * Ensure that we always scan at least one object - memcg kmem
	 * accounting uses this to fully empty the caches.
	 */
	sc->nr_to_scan = dentries + 1;
	freed = prune_dcache_sb(sb, sc);
	sc->nr_to_scan = inodes + 1;
	freed += prune_icache_sb(sb, sc);

	if (fs_objects) {
		sc->nr_to_scan = fs_objects + 1;
		freed += sb->s_op->free_cached_objects(sb, sc);
	}

	up_read(&sb->s_umount);
	return freed;
}

**prune_icache_sb()**でinodeを、**prune_dcache_sb()**でdentryを、それぞれ開放している。

inodeのshrinkerでの開放(prune_icache_sb())

**prune_icache_sb()**は、kernel/fs/inode.cより、

inode.c

/*
 * Walk the superblock inode LRU for freeable inodes and attempt to free them.
 * This is called from the superblock shrinker function with a number of inodes
 * to trim from the LRU. Inodes to be freed are moved to a temporary list and
 * then are freed outside inode_lock by dispose_list().
 */
long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
{
	LIST_HEAD(freeable);
	long freed;

	freed = list_lru_shrink_walk(&sb->s_inode_lru, sc,
				     inode_lru_isolate, &freeable);
	dispose_list(&freeable);
	return freed;
}

少し端折って書くと、superblockにあるs_inode_lruをwalkし、inodeごとにinode_lru_isolate()を呼んでfreeできるかチェックし、freeできるものをリストのfreeableにつっこみ、**dispose_list()**でまとめて開放している。**inode_lru_isolate()**は、

inode.c

/*
 * Isolate the inode from the LRU in preparation for freeing it.
 *
 * Any inodes which are pinned purely because of attached pagecache have their
 * pagecache removed.  If the inode has metadata buffers attached to
 * mapping->private_list then try to remove them.
 *
 * If the inode has the I_REFERENCED flag set, then it means that it has been
 * used recently - the flag is set in iput_final(). When we encounter such an
 * inode, clear the flag and move it to the back of the LRU so it gets another
 * pass through the LRU before it gets reclaimed. This is necessary because of
 * the fact we are doing lazy LRU updates to minimise lock contention so the
 * LRU does not have strict ordering. Hence we don't want to reclaim inodes
 * with this flag set because they are the inodes that are out of order.
 */
static enum lru_status inode_lru_isolate(struct list_head *item,
		struct list_lru_one *lru, spinlock_t *lru_lock, void *arg)
{
	struct list_head *freeable = arg;
	struct inode	*inode = container_of(item, struct inode, i_lru);

	/*
	 * we are inverting the lru lock/inode->i_lock here, so use a trylock.
	 * If we fail to get the lock, just skip it.
	 */
	if (!spin_trylock(&inode->i_lock))
		return LRU_SKIP;

	/*
	 * Referenced or dirty inodes are still in use. Give them another pass
	 * through the LRU as we canot reclaim them now.
	 */
	if (atomic_read(&inode->i_count) ||
	    (inode->i_state & ~I_REFERENCED)) {
		list_lru_isolate(lru, &inode->i_lru);
		spin_unlock(&inode->i_lock);
		this_cpu_dec(nr_unused);
		return LRU_REMOVED;
	}

	/* recently referenced inodes get one more pass */
	if (inode->i_state & I_REFERENCED) {
		inode->i_state &= ~I_REFERENCED;
		spin_unlock(&inode->i_lock);
		return LRU_ROTATE;
	}

	if (inode_has_buffers(inode) || inode->i_data.nrpages) {
		__iget(inode);
		spin_unlock(&inode->i_lock);
		spin_unlock(lru_lock);
		if (remove_inode_buffers(inode)) {
			unsigned long reap;
			reap = invalidate_mapping_pages(&inode->i_data, 0, -1);
			if (current_is_kswapd())
				__count_vm_events(KSWAPD_INODESTEAL, reap);
			else
				__count_vm_events(PGINODESTEAL, reap);
			if (current->reclaim_state)
				current->reclaim_state->reclaimed_slab += reap;
		}
		iput(inode);
		spin_lock(lru_lock);
		return LRU_RETRY;
	}

	WARN_ON(inode->i_state & I_NEW);
	inode->i_state |= I_FREEING;
	list_lru_isolate_move(lru, &inode->i_lru, freeable);
	spin_unlock(&inode->i_lock);

	this_cpu_dec(nr_unused);
	return LRU_REMOVED;
}

最近使われたかとかi_mappingが残っているかとかを注意深くチェックしたあとではじめて、freeableにつなげて開放できると判断している。**dispose_list()**は、

inode.c

/*
 * dispose_list - dispose of the contents of a local list
 * @head: the head of the list to free
 *
 * Dispose-list gets a local list with local inodes in it, so it doesn't
 * need to worry about list corruption and SMP locks.
 */
static void dispose_list(struct list_head *head)
{
	while (!list_empty(head)) {
		struct inode *inode;

		inode = list_first_entry(head, struct inode, i_lru);
		list_del_init(&inode->i_lru);

		evict(inode);
		cond_resched();
	}
}

特に難しいことはせずに素直な処理になっている。で、**evict()**が、実際のinodeの開放処理となっている。

inode.c

/*
 * Free the inode passed in, removing it from the lists it is still connected
 * to. We remove any pages still attached to the inode and wait for any IO that
 * is still in progress before finally destroying the inode.
 *
 * An inode must already be marked I_FREEING so that we avoid the inode being
 * moved back onto lists if we race with other code that manipulates the lists
 * (e.g. writeback_single_inode). The caller is responsible for setting this.
 *
 * An inode must already be removed from the LRU list before being evicted from
 * the cache. This should occur atomically with setting the I_FREEING state
 * flag, so no inodes here should ever be on the LRU when being evicted.
 */
static void evict(struct inode *inode)
{
	const struct super_operations *op = inode->i_sb->s_op;

	BUG_ON(!(inode->i_state & I_FREEING));
	BUG_ON(!list_empty(&inode->i_lru));

	if (!list_empty(&inode->i_io_list))
		inode_io_list_del(inode);

	inode_sb_list_del(inode);

	/*
	 * Wait for flusher thread to be done with the inode so that filesystem
	 * does not start destroying it while writeback is still running. Since
	 * the inode has I_FREEING set, flusher thread won't start new work on
	 * the inode.  We just have to wait for running writeback to finish.
	 */
	inode_wait_for_writeback(inode);

	if (op->evict_inode) {
		op->evict_inode(inode);
	} else {
		truncate_inode_pages_final(&inode->i_data);
		clear_inode(inode);
	}
	if (S_ISBLK(inode->i_mode) && inode->i_bdev)
		bd_forget(inode);
	if (S_ISCHR(inode->i_mode) && inode->i_cdev)
		cd_forget(inode);

	remove_inode_hash(inode);

	spin_lock(&inode->i_lock);
	wake_up_bit(&inode->i_state, __I_NEW);
	BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
	spin_unlock(&inode->i_lock);

	destroy_inode(inode);
}

あまり何もしていないかと思いきや、いろいろやっていて、しかもFilesystem固有のop->evict_inodeを呼んでいたりもする。ext4の場合はこれは**ext4_evict_inode()**でkernel/fs/ext4/inode.cより、

idnoe.c

/*
 * Called at the last iput() if i_nlink is zero.
 */
void ext4_evict_inode(struct inode *inode)
{
	handle_t *handle;
	int err;

	trace_ext4_evict_inode(inode);

	if (inode->i_nlink) {
		/*
		 * When journalling data dirty buffers are tracked only in the
		 * journal. So although mm thinks everything is clean and
		 * ready for reaping the inode might still have some pages to
		 * write in the running transaction or waiting to be
		 * checkpointed. Thus calling jbd2_journal_invalidatepage()
		 * (via truncate_inode_pages()) to discard these buffers can
		 * cause data loss. Also even if we did not discard these
		 * buffers, we would have no way to find them after the inode
		 * is reaped and thus user could see stale data if he tries to
		 * read them before the transaction is checkpointed. So be
		 * careful and force everything to disk here... We use
		 * ei->i_datasync_tid to store the newest transaction
		 * containing inode's data.
		 *
		 * Note that directories do not have this problem because they
		 * don't use page cache.
		 */
		if (inode->i_ino != EXT4_JOURNAL_INO &&
		    ext4_should_journal_data(inode) &&
		    (S_ISLNK(inode->i_mode) || S_ISREG(inode->i_mode))) {
			journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
			tid_t commit_tid = EXT4_I(inode)->i_datasync_tid;

			jbd2_complete_transaction(journal, commit_tid);
			filemap_write_and_wait(&inode->i_data);
		}
		truncate_inode_pages_final(&inode->i_data);

		goto no_delete;
	}

	if (is_bad_inode(inode))
		goto no_delete;
	dquot_initialize(inode);

	if (ext4_should_order_data(inode))
		ext4_begin_ordered_truncate(inode, 0);
	truncate_inode_pages_final(&inode->i_data);

	/*
	 * Protect us against freezing - iput() caller didn't have to have any
	 * protection against it
	 */
	sb_start_intwrite(inode->i_sb);
	handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE,
				    ext4_blocks_for_truncate(inode)+3);
	if (IS_ERR(handle)) {
		ext4_std_error(inode->i_sb, PTR_ERR(handle));
		/*
		 * If we're going to skip the normal cleanup, we still need to
		 * make sure that the in-core orphan linked list is properly
		 * cleaned up.
		 */
		ext4_orphan_del(NULL, inode);
		sb_end_intwrite(inode->i_sb);
		goto no_delete;
	}

	if (IS_SYNC(inode))
		ext4_handle_sync(handle);
	inode->i_size = 0;
	err = ext4_mark_inode_dirty(handle, inode);
	if (err) {
		ext4_warning(inode->i_sb,
			     "couldn't mark inode dirty (err %d)", err);
		goto stop_handle;
	}
	if (inode->i_blocks) {
		err = ext4_truncate(inode);
		if (err) {
			ext4_error(inode->i_sb,
				   "couldn't truncate inode %lu (err %d)",
				   inode->i_ino, err);
			goto stop_handle;
		}
	}

	/*
	 * ext4_ext_truncate() doesn't reserve any slop when it
	 * restarts journal transactions; therefore there may not be
	 * enough credits left in the handle to remove the inode from
	 * the orphan list and set the dtime field.
	 */
	if (!ext4_handle_has_enough_credits(handle, 3)) {
		err = ext4_journal_extend(handle, 3);
		if (err > 0)
			err = ext4_journal_restart(handle, 3);
		if (err != 0) {
			ext4_warning(inode->i_sb,
				     "couldn't extend journal (err %d)", err);
		stop_handle:
			ext4_journal_stop(handle);
			ext4_orphan_del(NULL, inode);
			sb_end_intwrite(inode->i_sb);
			goto no_delete;
		}
	}

	/*
	 * Kill off the orphan record which ext4_truncate created.
	 * AKPM: I think this can be inside the above `if'.
	 * Note that ext4_orphan_del() has to be able to cope with the
	 * deletion of a non-existent orphan - this is because we don't
	 * know if ext4_truncate() actually created an orphan record.
	 * (Well, we could do this if we need to, but heck - it works)
	 */
	ext4_orphan_del(handle, inode);
	EXT4_I(inode)->i_dtime	= get_seconds();

	/*
	 * One subtle ordering requirement: if anything has gone wrong
	 * (transaction abort, IO errors, whatever), then we can still
	 * do these next steps (the fs will already have been marked as
	 * having errors), but we can't free the inode if the mark_dirty
	 * fails.
	 */
	if (ext4_mark_inode_dirty(handle, inode))
		/* If that failed, just do the required in-core inode clear. */
		ext4_clear_inode(inode);
	else
		ext4_free_inode(handle, inode);
	ext4_journal_stop(handle);
	sb_end_intwrite(inode->i_sb);
	return;
no_delete:
	ext4_clear_inode(inode);	/* We must guarantee clearing of inode... */
}

かなり色々やっているようで、理解するのはやめて見なかったことにしよう、うん。

inodeのshrinkerでの開放(prune_dcache_sb())

**prune_dcache_sb()**は、kernel/fs/dcache.cより、

dcache.c

/**
 * prune_dcache_sb - shrink the dcache
 * @sb: superblock
 * @sc: shrink control, passed to list_lru_shrink_walk()
 *
 * Attempt to shrink the superblock dcache LRU by @sc->nr_to_scan entries. This
 * is done when we need more memory and called from the superblock shrinker
 * function.
 *
 * This function may fail to free any resources if all the dentries are in
 * use.
 */
long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc)
{
	LIST_HEAD(dispose);
	long freed;

	freed = list_lru_shrink_walk(&sb->s_dentry_lru, sc,
				     dentry_lru_isolate, &dispose);
	shrink_dentry_list(&dispose);
	return freed;
}

先の**prune_icache_sb()**と類似していて、list_lru_shrink_walk()でs_dentry_lruをwalkしつつ、dentry_lru_isolate()で開放できるかを判断してリストのdisposeにつなぎ、**shrink_dentry_list()**でまとめて開放している。**dentry_lru_isolate()**は、

dcache.c

static enum lru_status dentry_lru_isolate(struct list_head *item,
		struct list_lru_one *lru, spinlock_t *lru_lock, void *arg)
{
	struct list_head *freeable = arg;
	struct dentry	*dentry = container_of(item, struct dentry, d_lru);


	/*
	 * we are inverting the lru lock/dentry->d_lock here,
	 * so use a trylock. If we fail to get the lock, just skip
	 * it
	 */
	if (!spin_trylock(&dentry->d_lock))
		return LRU_SKIP;

	/*
	 * Referenced dentries are still in use. If they have active
	 * counts, just remove them from the LRU. Otherwise give them
	 * another pass through the LRU.
	 */
	if (dentry->d_lockref.count) {
		d_lru_isolate(lru, dentry);
		spin_unlock(&dentry->d_lock);
		return LRU_REMOVED;
	}

	if (dentry->d_flags & DCACHE_REFERENCED) {
		dentry->d_flags &= ~DCACHE_REFERENCED;
		spin_unlock(&dentry->d_lock);

		/*
		 * The list move itself will be made by the common LRU code. At
		 * this point, we've dropped the dentry->d_lock but keep the
		 * lru lock. This is safe to do, since every list movement is
		 * protected by the lru lock even if both locks are held.
		 *
		 * This is guaranteed by the fact that all LRU management
		 * functions are intermediated by the LRU API calls like
		 * list_lru_add and list_lru_del. List movement in this file
		 * only ever occur through this functions or through callbacks
		 * like this one, that are called from the LRU API.
		 *
		 * The only exceptions to this are functions like
		 * shrink_dentry_list, and code that first checks for the
		 * DCACHE_SHRINK_LIST flag.  Those are guaranteed to be
		 * operating only with stack provided lists after they are
		 * properly isolated from the main list.  It is thus, always a
		 * local access.
		 */
		return LRU_ROTATE;
	}

	d_lru_shrink_move(lru, dentry, freeable);
	spin_unlock(&dentry->d_lock);

	return LRU_REMOVED;
}

使用状態やらロック状態やらを注意深くチェックした後に、d_lru_shrink_move()でリストのfreeableへ繋いでいる。**shrink_dentry_list()**は、

dcache.c

static void shrink_dentry_list(struct list_head *list)
{
	struct dentry *dentry, *parent;

	while (!list_empty(list)) {
		struct inode *inode;
		dentry = list_entry(list->prev, struct dentry, d_lru);
		spin_lock(&dentry->d_lock);
		parent = lock_parent(dentry);

		/*
		 * The dispose list is isolated and dentries are not accounted
		 * to the LRU here, so we can simply remove it from the list
		 * here regardless of whether it is referenced or not.
		 */
		d_shrink_del(dentry);

		/*
		 * We found an inuse dentry which was not removed from
		 * the LRU because of laziness during lookup. Do not free it.
		 */
		if (dentry->d_lockref.count > 0) {
			spin_unlock(&dentry->d_lock);
			if (parent)
				spin_unlock(&parent->d_lock);
			continue;
		}


		if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
			bool can_free = dentry->d_flags & DCACHE_MAY_FREE;
			spin_unlock(&dentry->d_lock);
			if (parent)
				spin_unlock(&parent->d_lock);
			if (can_free)
				dentry_free(dentry);
			continue;
		}

		inode = dentry->d_inode;
		if (inode && unlikely(!spin_trylock(&inode->i_lock))) {
			d_shrink_add(dentry, list);
			spin_unlock(&dentry->d_lock);
			if (parent)
				spin_unlock(&parent->d_lock);
			continue;
		}

		__dentry_kill(dentry);

		/*
		 * We need to prune ancestors too. This is necessary to prevent
		 * quadratic behavior of shrink_dcache_parent(), but is also
		 * expected to be beneficial in reducing dentry cache
		 * fragmentation.
		 */
		dentry = parent;
		while (dentry && !lockref_put_or_lock(&dentry->d_lockref)) {
			parent = lock_parent(dentry);
			if (dentry->d_lockref.count != 1) {
				dentry->d_lockref.count--;
				spin_unlock(&dentry->d_lock);
				if (parent)
					spin_unlock(&parent->d_lock);
				break;
			}
			inode = dentry->d_inode;	/* can't be NULL */
			if (unlikely(!spin_trylock(&inode->i_lock))) {
				spin_unlock(&dentry->d_lock);
				if (parent)
					spin_unlock(&parent->d_lock);
				cpu_relax();
				continue;
			}
			__dentry_kill(dentry);
			dentry = parent;
		}
	}
}

微妙な状態になることもあるようでそのケアをしつつ親のケアもしているようだ。ただ正直わかったようなわからないような。実際の開放処理は**__dentry_kill()**で行っている。

dcache.c

static void __dentry_kill(struct dentry *dentry)
{
	struct dentry *parent = NULL;
	bool can_free = true;
	if (!IS_ROOT(dentry))
		parent = dentry->d_parent;

	/*
	 * The dentry is now unrecoverably dead to the world.
	 */
	lockref_mark_dead(&dentry->d_lockref);

	/*
	 * inform the fs via d_prune that this dentry is about to be
	 * unhashed and destroyed.
	 */
	if (dentry->d_flags & DCACHE_OP_PRUNE)
		dentry->d_op->d_prune(dentry);

	if (dentry->d_flags & DCACHE_LRU_LIST) {
		if (!(dentry->d_flags & DCACHE_SHRINK_LIST))
			d_lru_del(dentry);
	}
	/* if it was on the hash then remove it */
	__d_drop(dentry);
	dentry_unlist(dentry, parent);
	if (parent)
		spin_unlock(&parent->d_lock);
	if (dentry->d_inode)
		dentry_unlink_inode(dentry);
	else
		spin_unlock(&dentry->d_lock);
	this_cpu_dec(nr_dentry);
	if (dentry->d_op && dentry->d_op->d_release)
		dentry->d_op->d_release(dentry);

	spin_lock(&dentry->d_lock);
	if (dentry->d_flags & DCACHE_SHRINK_LIST) {
		dentry->d_flags |= DCACHE_MAY_FREE;
		can_free = false;
	}
	spin_unlock(&dentry->d_lock);
	if (likely(can_free))
		dentry_free(dentry);
}

evict_inode()と同じように、Filesistem固有のd_op->d_pruneを呼ぶ仕組みがある。幸い(？)、これを使っているFilesystemはほとんどなく、ext4も使っていない。ちなみに、kernel/fs/dcache.cより、

dcache.c

void d_set_d_op(struct dentry *dentry, const struct dentry_operations *op)
{
	WARN_ON_ONCE(dentry->d_op);
	WARN_ON_ONCE(dentry->d_flags & (DCACHE_OP_HASH	|
				DCACHE_OP_COMPARE	|
				DCACHE_OP_REVALIDATE	|
				DCACHE_OP_WEAK_REVALIDATE	|
				DCACHE_OP_DELETE	|
				DCACHE_OP_REAL));
	dentry->d_op = op;
	if (!op)
		return;
	if (op->d_hash)
		dentry->d_flags |= DCACHE_OP_HASH;
	if (op->d_compare)
		dentry->d_flags |= DCACHE_OP_COMPARE;
	if (op->d_revalidate)
		dentry->d_flags |= DCACHE_OP_REVALIDATE;
	if (op->d_weak_revalidate)
		dentry->d_flags |= DCACHE_OP_WEAK_REVALIDATE;
	if (op->d_delete)
		dentry->d_flags |= DCACHE_OP_DELETE;
	if (op->d_prune)
		dentry->d_flags |= DCACHE_OP_PRUNE;
	if (op->d_real)
		dentry->d_flags |= DCACHE_OP_REAL;

}

いろんなエントリ関数を呼ぶ仕組みを用意しているようだ。

inodeのlruリスト操作

shrinkerから**prune_icache_sb()**が呼ばれるとlruリストにつながれたinodeが順にスキャンされることを確認したが、じゃいつlruリストに繋がれるのかというと、kernel/fs/inode.cより、

inode.c

static void inode_lru_list_add(struct inode *inode)
{
	if (list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru))
		this_cpu_inc(nr_unused);
	else
		inode->i_state |= I_REFERENCED;
}

/*
 * Add inode to LRU if needed (inode is unused and clean).
 *
 * Needs inode->i_lock held.
 */
void inode_add_lru(struct inode *inode)
{
	if (!(inode->i_state & (I_DIRTY_ALL | I_SYNC |
				I_FREEING | I_WILL_FREE)) &&
	    !atomic_read(&inode->i_count) && inode->i_sb->s_flags & MS_ACTIVE)
		inode_lru_list_add(inode);
}

**inode_add_lru()**を呼んだときとなる。**inode_add_lru()**を呼ぶ箇所は2箇所ある。1箇所目は、kernel/fs/inode.cのiput_final()

inode.c

/*
 * Called when we're dropping the last reference
 * to an inode.
 *
 * Call the FS "drop_inode()" function, defaulting to
 * the legacy UNIX filesystem behaviour.  If it tells
 * us to evict inode, do so.  Otherwise, retain inode
 * in cache if fs is alive, sync and evict if fs is
 * shutting down.
 */
static void iput_final(struct inode *inode)
{
	struct super_block *sb = inode->i_sb;
	const struct super_operations *op = inode->i_sb->s_op;
	int drop;

	WARN_ON(inode->i_state & I_NEW);

	if (op->drop_inode)
		drop = op->drop_inode(inode);
	else
		drop = generic_drop_inode(inode);

	if (!drop && (sb->s_flags & MS_ACTIVE)) {
		inode_add_lru(inode);
		spin_unlock(&inode->i_lock);
		return;
	}

	if (!drop) {
		inode->i_state |= I_WILL_FREE;
		spin_unlock(&inode->i_lock);
		write_inode_now(inode, 1);
		spin_lock(&inode->i_lock);
		WARN_ON(inode->i_state & I_NEW);
		inode->i_state &= ~I_WILL_FREE;
	}

	inode->i_state |= I_FREEING;
	if (!list_empty(&inode->i_lru))
		inode_lru_list_del(inode);
	spin_unlock(&inode->i_lock);

	evict(inode);
}

**iput_final()はinodeを参照する人がいなくなった(参照数が1から0になった)ときに呼ばれる関数。コメントより、ほとんどの場合はiput_final()が呼ばれたらinode_add_lru()**を呼ぶ系に落ちると思われる。

2箇所目がkernel/fs/fs-writeback.cのinode_sync_complete()で、

fs-writeback.c

static void inode_sync_complete(struct inode *inode)
{
	inode->i_state &= ~I_SYNC;
	/* If inode is clean an unused, put it into LRU now... */
	inode_add_lru(inode);
	/* Waiters must see I_SYNC cleared before being woken up */
	smp_mb();
	wake_up_bit(&inode->i_state, __I_SYNC);
}

なんかコメントに反してunusedじゃないときも無条件で**inode_add_lru()を呼んでいるが、unusedじゃない場合でもinode_add_lru()**の中でのi_countチェックで救われる・・・のか？

というわけで、inode->i_countがゼロになったinodeがLRUリストに入り、shrinkerからの要求時に開放されるという流れがわかった。

dentryのlruリスト操作

inodeの時と同様に、kernel/fs/dcache.cのd_lru_add()が呼ばれると、sb->s_dentry_lruに登録されshrinkerの対象になる。

dcache.c

/*
 * The DCACHE_LRU_LIST bit is set whenever the 'd_lru' entry
 * is in use - which includes both the "real" per-superblock
 * LRU list _and_ the DCACHE_SHRINK_LIST use.
 *
 * The DCACHE_SHRINK_LIST bit is set whenever the dentry is
 * on the shrink list (ie not on the superblock LRU list).
 *
 * The per-cpu "nr_dentry_unused" counters are updated with
 * the DCACHE_LRU_LIST bit.
 *
 * These helper functions make sure we always follow the
 * rules. d_lock must be held by the caller.
 */
# define D_FLAG_VERIFY(dentry,x) WARN_ON_ONCE(((dentry)->d_flags & (DCACHE_LRU_LIST | DCACHE_SHRINK_LIST)) != (x))
static void d_lru_add(struct dentry *dentry)
{
	D_FLAG_VERIFY(dentry, 0);
	dentry->d_flags |= DCACHE_LRU_LIST;
	this_cpu_inc(nr_dentry_unused);
	WARN_ON_ONCE(!list_lru_add(&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
}

dcache.c

/*
 * dentry_lru_(add|del)_list) must be called with d_lock held.
 */
static void dentry_lru_add(struct dentry *dentry)
{
	if (unlikely(!(dentry->d_flags & DCACHE_LRU_LIST)))
		d_lru_add(dentry);
	else if (unlikely(!(dentry->d_flags & DCACHE_REFERENCED)))
		dentry->d_flags |= DCACHE_REFERENCED;
}

dcache.c

/* 
 * This is dput
 *
 * This is complicated by the fact that we do not want to put
 * dentries that are no longer on any hash chain on the unused
 * list: we'd much rather just get rid of them immediately.
 *
 * However, that implies that we have to traverse the dentry
 * tree upwards to the parents which might _also_ now be
 * scheduled for deletion (it may have been only waiting for
 * its last child to go away).
 *
 * This tail recursion is done by hand as we don't want to depend
 * on the compiler to always get this right (gcc generally doesn't).
 * Real recursion would eat up our stack space.
 */

/*
 * dput - release a dentry
 * @dentry: dentry to release 
 *
 * Release a dentry. This will drop the usage count and if appropriate
 * call the dentry unlink method as well as removing it from the queues and
 * releasing its resources. If the parent dentries were scheduled for release
 * they too may now get deleted.
 */
void dput(struct dentry *dentry)
{
	if (unlikely(!dentry))
		return;

repeat:
	might_sleep();

	rcu_read_lock();
	if (likely(fast_dput(dentry))) {
		rcu_read_unlock();
		return;
	}

	/* Slow case: now with the dentry lock held */
	rcu_read_unlock();

	WARN_ON(d_in_lookup(dentry));

	/* Unreachable? Get rid of it */
	if (unlikely(d_unhashed(dentry)))
		goto kill_it;

	if (unlikely(dentry->d_flags & DCACHE_DISCONNECTED))
		goto kill_it;

	if (unlikely(dentry->d_flags & DCACHE_OP_DELETE)) {
		if (dentry->d_op->d_delete(dentry))
			goto kill_it;
	}

	dentry_lru_add(dentry);

	dentry->d_lockref.count--;
	spin_unlock(&dentry->d_lock);
	return;

kill_it:
	dentry = dentry_kill(dentry);
	if (dentry) {
		cond_resched();
		goto repeat;
	}
}

**dput()->dentry_lru_add()->d_lru_add()**の系からしか呼ばれない。**fast_dput()もなかなかのわかりにくさで、結局いつdentry_lru_add()**が呼ばれるのか理解しづらいけど、実はドキュメントに書かれていたりして、kernel/Documentation/filesystems/vfs.txtより、

vfs.txt

  dput: close a handle for a dentry (decrements the usage count). If
	the usage count drops to 0, and the dentry is still in its
	parent's hash, the "d_delete" method is called to check whether
	it should be cached. If it should not be cached, or if the dentry
	is not hashed, it is deleted. Otherwise cached dentries are put
	into an LRU list to be reclaimed on memory shortage.

ということで、参照数がゼロになった、かつ**op->d_delete()**関数が0を返したら(==参照がなくなったけどキャッシュしたままにしてほしい)、の場合となる。というわけで、参照がなくなったdentryはLRUリスト入りしてshrinkerからの解放要求のときの対象になることがわかった。

procps-ngのソースを読む

ついでなので、わざわざ自分でMemAvailable互換の表示を実装しちゃった**free(1)**のソースも確認してみる。freeで表示されるavailableはprocps/free.cより、

free.c

		meminfo();
		/* Translation Hint: You can use 9 character words in
		 * the header, and the words need to be right align to
		 * beginning of a number. */
		if (flags & FREE_WIDE) {
			printf(_("              total        used        free      shared     buffers       cache   available"));
		} else {
			printf(_("              total        used        free      shared  buff/cache   available"));
		}
		printf("\n");
		printf("%-7s", _("Mem:"));
		printf(" %11s", scale_size(kb_main_total, flags, args));
		printf(" %11s", scale_size(kb_main_used, flags, args));
		printf(" %11s", scale_size(kb_main_free, flags, args));
		printf(" %11s", scale_size(kb_main_shared, flags, args));
		if (flags & FREE_WIDE) {
			printf(" %11s", scale_size(kb_main_buffers, flags, args));
			printf(" %11s", scale_size(kb_main_cached, flags, args));
		} else {
			printf(" %11s", scale_size(kb_main_buffers+kb_main_cached, flags, args));
		}
		printf(" %11s", scale_size(kb_main_available, flags, args));
		printf("\n");

グローバル変数のkb_main_availableを表示している。これは、procps/proc/sysinfo.cのmem_table[]より、

sysinfo.c

  {"MemAvailable", &kb_main_available}, // important

何がimportantなのかは知らないけど、/proc/meminfoから取得した値をパースして入れている。ただ、MemAvailableが取れないkernelだった場合の互換のための処理も追加されていて、procps/proc/sysinfo.cより、

sysinfo.c

  /* zero? might need fallback for 2.6.27 <= kernel <? 3.14 */
  if (!kb_main_available) {
    if (linux_version_code < LINUX_VERSION(2, 6, 27))
      kb_main_available = kb_main_free;
    else {
      FILE_TO_BUF(VM_MIN_FREE_FILE, vm_min_free_fd);
      kb_min_free = (unsigned long) strtoull(buf,&tail,10);

      watermark_low = kb_min_free * 5 / 4; /* should be equal to sum of all 'low' fields in /proc/zoneinfo */

      mem_available = (signed long)kb_main_free - watermark_low
      + kb_inactive_file + kb_active_file - MIN((kb_inactive_file + kb_active_file) / 2, watermark_low)
      + kb_slab_reclaimable - MIN(kb_slab_reclaimable / 2, watermark_low);

      if (mem_available < 0) mem_available = 0;
      kb_main_available = (unsigned long)mem_available;
    }
  }

と、なかなかの謎な計算式が書かれている。ただ、根拠なくこんな計算をしているわけはなくて、kernel/mm/page_alloc.cのsi_mem_available()より、

page_alloc.c

long si_mem_available(void)
{
	long available;
	unsigned long pagecache;
	unsigned long wmark_low = 0;
	unsigned long pages[NR_LRU_LISTS];
	struct zone *zone;
	int lru;

	for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++)
		pages[lru] = global_node_page_state(NR_LRU_BASE + lru);

	for_each_zone(zone)
		wmark_low += zone->watermark[WMARK_LOW];

	/*
	 * Estimate the amount of memory available for userspace allocations,
	 * without causing swapping.
	 */
	available = global_page_state(NR_FREE_PAGES) - totalreserve_pages;

	/*
	 * Not all the page cache can be freed, otherwise the system will
	 * start swapping. Assume at least half of the page cache, or the
	 * low watermark worth of cache, needs to stay.
	 */
	pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE];
	pagecache -= min(pagecache / 2, wmark_low);
	available += pagecache;

	/*
	 * Part of the reclaimable slab consists of items that are in use,
	 * and cannot be freed. Cap this estimate at the low watermark.
	 */
	available += global_page_state(NR_SLAB_RECLAIMABLE) -
		     min(global_page_state(NR_SLAB_RECLAIMABLE) / 2, wmark_low);

	if (available < 0)
		available = 0;
	return available;
}

freeコマンドではwatermark_lowを適当に計算している点が異なるものの、おおむねkernelと同じ計算をしていることから、freeはkernel ~~からパクった~~を参考に実装していることがわかる。

...ところで「watermark」ってニュアンス含めるとなんて訳すのがいいの？辞書見てもいまいちわからない。「水位をチェックするための目盛り」程度の意味でいい？

ページキャッシュにまつわる雑多な話

ただの自分向け備忘録とも言うけど。

madvise

madvise(2)で使えるMADV_DONTNEED, MADV_FREEというのがある。

  MADV_DONTNEED

          Do not expect access in the near future.  (For the time being,
          the application is finished with the given range, so the
          kernel can free resources associated with it.)

          After a successful MADV_DONTNEED operation, the semantics of
          memory access in the specified region are changed: subsequent
          accesses of pages in the range will succeed, but will result
          in either repopulating the memory contents from the up-to-date
          contents of the underlying mapped file (for shared file
          mappings, shared anonymous mappings, and shmem-based
          techniques such as System V shared memory segments) or zero-
          fill-on-demand pages for anonymous private mappings.

          Note that, when applied to shared mappings, MADV_DONTNEED
          might not lead to immediate freeing of the pages in the range.
          The kernel is free to delay freeing the pages until an
          appropriate moment.  The resident set size (RSS) of the
          calling process will be immediately reduced however.

          MADV_DONTNEED cannot be applied to locked pages, Huge TLB
          pages, or VM_PFNMAP pages.  (Pages marked with the kernel-
          internal VM_PFNMAP flag are special memory areas that are not
          managed by the virtual memory subsystem.  Such pages are
          typically created by device drivers that map the pages into
          user space.)

  MADV_FREE (since Linux 4.5)

          The application no longer requires the pages in the range
          specified by addr and len.  The kernel can thus free these
          pages, but the freeing could be delayed until memory pressure
          occurs.  For each of the pages that has been marked to be
          freed but has not yet been freed, the free operation will be
          canceled if the caller writes into the page.  After a
          successful MADV_FREE operation, any stale data (i.e., dirty,
          unwritten pages) will be lost when the kernel frees the pages.
          However, subsequent writes to pages in the range will succeed
          and then kernel cannot free those dirtied pages, so that the
          caller can always see just written data.  If there is no
          subsequent write, the kernel can free the pages at any time.
          Once pages in the range have been freed, the caller will see
          zero-fill-on-demand pages upon subsequent page references.

          The MADV_FREE operation can be applied only to private
          anonymous pages (see mmap(2)).  On a swapless system, freeing
          pages in a given range happens instantly, regardless of memory
          pressure.

メモリ観点では、MADV_DONTNEEDは直ちに開放するのに対して、MADV_FREEはpressure(shrinkerのこと)が来るまで開放されないかもしれない。開放されていなかったら先に触ったメモリの状態がそのまま残っているのに対し、開放されていたら触ったときに改めてメモリ確保が行われ、anonの場合(MADV_FREEはanonにしか使えない)はゼロページが割当たる。・・・というなかなかわかりづらい動きをする。

linux-ftoolsというツールがあり、どのファイルがページキャッシュに載っているかの確認や、個別ファイルごとにページキャッシュから追い出したりがでいるが、これが上記のmadviseを使って実装している。

O_DIRECT

先にも出したけど、O_DIRECTはページキャッシュをバイパスする目的でopen(2)の引数に指定できる。が、別のプロセスがO_DIRECTをつけずにopenしたり**mmap(2)**したりもできるので、ページキャッシュが全く登場しなくなるわけではない。Filesystemの実装にもよるけど、O_DIRECTのread/writeがされるたびごとに、ページキャッシュの破棄・O_DIRECTのread/writeの処理・ページキャッシュの再確保、が行われることになる。一見して分かる通り、ものすごく無駄が多い。

システム内の他のプロセスが何しているかなんてわからないことが多いので、O_DIRECTの使い所はやはりかなり限られる。O_DIRECTを使ったread/writeをするプロセスがほぼ専有しているようなシステムじゃないと使いこなせないんじゃないかとは思う。

あとがき

なんか脱線が多くて、自分でも何書いてるのかよくわからない文章になってしまい申し訳ない。drop_cachesにwriteするだけでページキャッシュからmmとFilesystemの深い泥沼の闇に飲み込まれてしまう様子がわかってもらえたら幸いだと思う。

ソース読みからわかるように、drop_cachesへのwriteは比較的「重い」処理であり、また競合していたら待たずに抜けてくる箇所も多いので、1回で根こそぎ回収できるというものでもない。使っている物を回収してしまうとおおごとになってしまうけど、使っていないものを回収し漏れしても害は少ないので、まぁそんなもんかなぁとは思う。

最初に戻るけど、デバッグしたりベンチマークを取る以外の目的でdrop_cachesを操作することはまずないと思っている。にも関わらず、drop_cachesにwriteする手順を書いた記事がたくさんあり、初心者を混乱させる元になってしまっているんじゃないかと危惧している。

参考サイト

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up