More than 5 years have passed since last update.

ルームクリップ株式会社

PostgreSQLのハッシュテーブルの実装を読む

PostgreSQL

Posted at 2019-01-26

環境

PostgreSQL 10.5

TL;DR

リニアハッシュテーブルを使っている。
リニアハッシュテーブル自体には並列性への考慮はないため、少なくとも1つの（場合によっては複数の）mutexを用意している。
Flyweightパターンを用いて高速化している。

はじめに

PostgreSQLではデータ構造としてハッシュテーブルは非常によく使われる。例えば、共有メモリで言うと、バッファ領域を管理するShared Buffer Lookup Tableやロック情報を保持するLOCK hashなどがある。バックエンドプロセスがローカルに保持するテーブルで言うと、LOCALLOCK hashやRelationをキャッシュするRelcache by OIDなどがある。以下ではこれの実装を見てみる。

リニアハッシュテーブルを使っている。

リニアハッシュって何やねん？という人もいると思う。以下の資料が良いと思う。

一番オススメの資料はPostgreSQLの実装の解説まであるPostgreSQL Internal： PostgreSQL Internal：ハッシュテーブルハッシュテーブルだ。これは関数の解説まである。リニアハッシュとは？の部分も丁寧に解説してあって、とても良い。

リニアハッシュは主題ではないが、Split-ordered linked list: lock free hash tableも良い。このスライドではリニアハッシュだけでなく、種々のハッシュテーブルについての説明がある。

詳細は上記の資料を見ていただきたいが、簡単に説明するとハッシュテーブルの要素数が増えてきたときに、bucketを動的に増やしたいが、その際に要素をどこのbucketに置くかということを再計算する必要がある。その計算量を少なくするためにリニアハッシュを使っている。

リニアハッシュ周りのソースは(your_path)/src/backend/utils/hash/dynahash.cにある。

基本的な構造体を以下に列挙して置く。

/*
 * HASHELEMENT is the private part of a hashtable entry.  The caller's data
 * follows the HASHELEMENT structure (on a MAXALIGN'd boundary).  The hash key
 * is expected to be at the start of the caller's hash entry data structure.
 */
typedef struct HASHELEMENT
{
	struct HASHELEMENT *link;	/* link to next entry in same bucket */
	uint32		hashvalue;		/* hash function result for this entry */
} HASHELEMENT;

コメントも合わせて見ていただきたいが、メモリ上ではHASHELEMENT構造体に続いて、keyとkey以外のデータが続く形になる。

/* A hash bucket is a linked list of HASHELEMENTs */
typedef HASHELEMENT *HASHBUCKET;

これがbucketである。

/* A hash segment is an array of bucket headers */
typedef HASHBUCKET *HASHSEGMENT;

これがsegmentである。

/*
 * Top control structure for a hashtable --- in a shared table, each backend
 * has its own copy (OK since no fields change at runtime)
 */
struct HTAB
{
	HASHHDR    *hctl;			/* => shared control information */
	HASHSEGMENT *dir;			/* directory of segment starts */
	HashValueFunc hash;			/* hash function */
	HashCompareFunc match;		/* key comparison function */
	HashCopyFunc keycopy;		/* key copying function */
	HashAllocFunc alloc;		/* memory allocator */
	MemoryContext hcxt;			/* memory context if default allocator used */
	char	   *tabname;		/* table name (for error messages) */
	bool		isshared;		/* true if table is in shared memory */
	bool		isfixed;		/* if true, don't enlarge */

	/* freezing a shared table isn't allowed, so we can keep state here */
	bool		frozen;			/* true = no more inserts allowed */

	/* We keep local copies of these fixed values to reduce contention */
	Size		keysize;		/* hash key length in bytes */
	long		ssize;			/* segment size --- must be power of 2 */
	int			sshift;			/* segment shift = log2(ssize) */
};

directoryはHTAB構造体のメンバである。

データを探すときには、まずどのbucketに該当の要素があるのかを計算する。

例えば、keyを探して、何らかのaction（HASH_FINDとかHASH_REMOVEとか。どんなactionがあるかはHASHACTIONというenumを見れば良い）を行うhash_search()という関数に呼ばれるhash_search_with_hash_value()には以下のようなコードがある。

/*
 * Do the initial lookup
 */
// hash値からbucket番号を得る
bucket = calc_bucket(hctl, hashvalue);

// bucketが含まれるsegmentの位置
segment_num = bucket >> hashp->sshift;
// bucketのsegment内でのindex
segment_ndx = MOD(bucket, hashp->ssize);

// segmentの先頭
segp = hashp->dir[segment_num];

if (segp == NULL)
	hash_corrupted(hashp);

// bucketが含まれるsegmentへのポインタ
prevBucketPtr = &segp[segment_ndx];
// 初期値は先頭の要素
currBucket = *prevBucketPtr;

hashvalueとhashpは引数で与えられ、それぞれunsigned intとHTABである。calc_bucket()はbucketの番号を返す。また、PostgreSQLではsegmentのsizeは必ず$2^n$なので、segmentの位置(segment_num)とbucketのsegment内でのindex(segment_ndx)がbit演算を用いて高速に計算可能。

次に該当の要素に行くため、連結リストをたどっていく。

while (currBucket != NULL)
	{
		if (currBucket->hashvalue == hashvalue &&
			match(ELEMENTKEY(currBucket), keyPtr, keysize) == 0)
			break;
		prevBucketPtr = &(currBucket->link);
		currBucket = *prevBucketPtr;
#if HASH_STATISTICS
		hash_collisions++;
		hctl->collisions++;
#endif
	}

最後にactionを実行する。

switch (action)
	{
		case HASH_FIND:
			if (currBucket != NULL)
				return (void *) ELEMENTKEY(currBucket);
			return NULL;

		case HASH_REMOVE:
			(略)

		case HASH_ENTER_NULL:
			(略)

		case HASH_ENTER:
			(略)

	}

ただし、ELEMENTKEY()はポインタからkeyの位置を計算するマクロである。

少なくとも1つの（場合によっては複数の）mutexを用意している

もう一度HTAB構造体を再掲する。

/*
 * Top control structure for a hashtable --- in a shared table, each backend
 * has its own copy (OK since no fields change at runtime)
 */
struct HTAB
{
	HASHHDR    *hctl;			/* => shared control information */
	HASHSEGMENT *dir;			/* directory of segment starts */
	HashValueFunc hash;			/* hash function */
	HashCompareFunc match;		/* key comparison function */
	HashCopyFunc keycopy;		/* key copying function */
	HashAllocFunc alloc;		/* memory allocator */
	MemoryContext hcxt;			/* memory context if default allocator used */
	char	   *tabname;		/* table name (for error messages) */
	bool		isshared;		/* true if table is in shared memory */
	bool		isfixed;		/* if true, don't enlarge */

	/* freezing a shared table isn't allowed, so we can keep state here */
	bool		frozen;			/* true = no more inserts allowed */

	/* We keep local copies of these fixed values to reduce contention */
	Size		keysize;		/* hash key length in bytes */
	long		ssize;			/* segment size --- must be power of 2 */
	int			sshift;			/* segment shift = log2(ssize) */
};

一番初めのメンバの型はHASHHDRのポインタ型となっている。HASHHDRがどういう定義かというと、以下の構造体である。

struct HASHHDR
{
	/*
	 * The freelist can become a point of contention in high-concurrency hash
	 * tables, so we use an array of freelists, each with its own mutex and
	 * nentries count, instead of just a single one.  Although the freelists
	 * normally operate independently, we will scavenge entries from freelists
	 * other than a hashcode's default freelist when necessary.
	 *
	 * If the hash table is not partitioned, only freeList[0] is used and its
	 * spinlock is not used at all; callers' locking is assumed sufficient.
	 */
	// NUM_FREELISTSはマクロで定義されており、32
	FreeListData freeList[NUM_FREELISTS];

	/* These fields can change, but not in a partitioned table */
	/* Also, dsize can't change in a shared table, even if unpartitioned */
	long		dsize;			/* directory size */
	long		nsegs;			/* number of allocated segments (<= dsize) */
	uint32		max_bucket;		/* ID of maximum bucket in use */
	uint32		high_mask;		/* mask to modulo into entire table */
	uint32		low_mask;		/* mask to modulo into lower half of table */

	/* These fields are fixed at hashtable creation */
	Size		keysize;		/* hash key length in bytes */
	Size		entrysize;		/* total user element size in bytes */
	long		num_partitions; /* # partitions (must be power of 2), or 0 */
	long		ffactor;		/* target fill factor */
	long		max_dsize;		/* 'dsize' limit if directory is fixed size */
	long		ssize;			/* segment size --- must be power of 2 */
	int			sshift;			/* segment shift = log2(ssize) */
	int			nelem_alloc;	/* number of entries to allocate at once */
};

FreeListDataがmutexを保持していて、以下のような定義となっている。

typedef struct
{
	slock_t		mutex;			/* spinlock for this freelist */
	long		nentries;		/* number of entries in associated buckets */
	HASHELEMENT *freeList;		/* chain of free elements */
} FreeListData;

そもそも何でこれが必要なのかというと次のような背景がある。ハッシュテーブルに操作する際、ロックをしたいことがあるが、ハッシュテーブルに対してmutexを一つだけ用意すると複数のプロセスが同時に操作するときにボトルネックとなる。これを回避するのがpartitioned lockingと呼ばれているアプローチで、テーブルの部分集合に対応するmutexを複数用意する。つまり、全てのハッシュ値をいくつかの集合に分けて、その集合に対してmutexを用意する。そうすると、一つのハッシュテーブルを複数のプロセスが同時に操作することが可能となる。

ハッシュテーブルをpartitioned modeで作るためにはハッシュテーブルを作るときに、partitionするよというビットを立てて、hash_create()を呼ぶ。
partitioned modeでないハッシュテーブルもfreeList[0]を使うので、全てのハッシュテーブルは対応するfreeListを持っていることになる。

例えば、以下のように使われる。

hash_search_with_hash_value()では、まず引数で渡されるhashvalueに対応するfreeListのindexを計算する。これはFREELIST_IDX()というマクロを使って計算される。

そして引数のactionにHASH_REMOVEを渡すと、以下の部分が実行される。

if (currBucket != NULL)
			{
				/* if partitioned, must lock to touch nentries and freeList */
				if (IS_PARTITIONED(hctl))
					SpinLockAcquire(&(hctl->freeList[freelist_idx].mutex));

				/* delete the record from the appropriate nentries counter. */
				Assert(hctl->freeList[freelist_idx].nentries > 0);
				hctl->freeList[freelist_idx].nentries--;

				/* remove record from hash bucket's chain. */
				*prevBucketPtr = currBucket->link;

				/* add the record to the appropriate freelist. */
				currBucket->link = hctl->freeList[freelist_idx].freeList;
				hctl->freeList[freelist_idx].freeList = currBucket;

				if (IS_PARTITIONED(hctl))
					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);

				/*
				 * better hope the caller is synchronizing access to this
				 * element, because someone else is going to reuse it the next
				 * time something is added to the table
				 */
				return (void *) ELEMENTKEY(currBucket);
			}
			return NULL;

コードを見れば分かるように、hashvalueに対応するfreeListを使って、spin lockを取得している。そして、spin lockを取得した後、3つの操作をしている。最初に、freeListDataのnentriesをデクリメントしている。nentriesはfreeListの要素数ではなく、そのfreeListが対応しているハッシュ値のうち、生きている（つまりなんらかのbucketにつながっているということ）要素数を意味する。よって、上記のコードはHASH_REMOVEのときに通るパスなので、生きている要素は一つ減るはずである。よって、HASH_REMOVEをデクリメントする。次にあるのは、*prevBucketPtr = currBucket->link;だが、これは連結リストの一つの要素を削除している。最後に削除した要素をfreeListに加えている。これについては次のセクションで説明する。

Flyweightパターンを用いて高速化している。

要素を削除するときに、freeListに追加していることを上で見た。結論から言うと、これはFlyweightパターンを使っているためだ。Flyweightパターンは一つのオブジェクト（今だとC言語なので、構造体）を使用した後、廃棄せずに保持しておき、どこかに置いておく。そして、必要になったときに保持していたオブジェクトを使う、というものだ。

PostgreSQLのハッシュテーブルでは、hash_create()でハッシュテーブルを作ったときに、element_alloc()を少なくとも1回は呼ぶ（partitioned modeなら複数回呼ぶことになる）。element_alloc()は要素のためのメモリをallocateするのだが、このときにfreeListにそれらを保持する。そして、ハッシュテーブルに要素を割り当てるときには、要素はfreeListからget_hash_entry()を使って得る。そして、削除するときには上記でも見た通り、対応するfreeListに加える。

element_alloc()の定義は以下の通り。hashp->alloc(nelem * elementSize)でメモリを割り当てて、このハッシュテーブルのentryのsizeに応じて、ポインタをきちんと設定してからfreeListに割り当てている。

static bool
element_alloc(HTAB *hashp, int nelem, int freelist_idx)
{
	HASHHDR    *hctl = hashp->hctl;
	Size		elementSize;
	HASHELEMENT *firstElement;
	HASHELEMENT *tmpElement;
	HASHELEMENT *prevElement;
	int			i;

	if (hashp->isfixed)
		return false;

	/* Each element has a HASHELEMENT header plus user data. */
	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);

	CurrentDynaHashCxt = hashp->hcxt;
	// ここでメモリを割り当てている
	firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);

	if (!firstElement)
		return false;

	/* prepare to link all the new entries into the freelist */
	prevElement = NULL;
	tmpElement = firstElement;
	for (i = 0; i < nelem; i++)
	{
		tmpElement->link = prevElement;
		prevElement = tmpElement;
		tmpElement = (HASHELEMENT *) (((char *) tmpElement) + elementSize);
	}

	/* if partitioned, must lock to touch freeList */
	if (IS_PARTITIONED(hctl))
		SpinLockAcquire(&hctl->freeList[freelist_idx].mutex);

	/* freelist could be nonempty if two backends did this concurrently */
	// ここでfreeListにつなげている
	firstElement->link = hctl->freeList[freelist_idx].freeList;
	hctl->freeList[freelist_idx].freeList = prevElement;

	if (IS_PARTITIONED(hctl))
		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);

	return true;
}

まとめ

PostgreSQLのハッシュテーブルの実装上の工夫を見てきた。見る人によって、気づく点は違うかもしれないが、僕は以下の3点で十分まとめきれていると思っている。

リニアハッシュテーブルを使っている。
リニアハッシュテーブル自体には並列性への考慮はないため、少なくとも1つの（場合によっては複数の）mutexを用意している。
Flyweightパターンを用いて高速化している。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up