More than 5 years have passed since last update.

vlocks for Bare-Metal Mutual Exclusion(2/2)

Posted at 2020-05-16

vlocks for Bare-Metal Mutual Exclusion

ARM implementation

The current ARM implementation [2] contains some optimisations beyond the basic algorithm:

現在のARMでの実装[2]には、基本的なアルゴリズム以外にもいくつかの最適化が含まれています。

・By packing the members of the currently_voting array close together, we can read the whole array in one transaction (providing the number of CPUs potentially contending the lock is small enough). This reduces the number of round-trips required to external memory.

・currently_voting配列のmember変数をpackすることで、1つのトランザクションで配列全体を読み取ることができます（ロックを競合する可能性のあるCPUの数が十分に少ない場合）。これにより、外部メモリで必要なround-tripsの数を削減できます。

In the ARM implementation, this means that we can use a single load and comparison:

ARM実装では、単一のロードと比較で使用できることを意味します。

LDR     Rt, [Rn]　　// Rtに、Rn番地の内容32bitを読み出す
CMP     Rt, #0      // Rtが0と一致するかを判定する

…in place of code equivalent to:

LDRB    Rt, [Rn]     // Rtに、Rn番地の内容8bitを読み出す
CMP     Rt, #0       // Rtが0と一致するかを判定する
LDRBEQ  Rt, [Rn, #1] // Rtに、Rn+1番地の内容8bitを読み出す
CMPEQ   Rt, #0       // Rtが0と一致するかを判定する
LDRBEQ  Rt, [Rn, #2] // Rtに、Rn+2番地の内容8bitを読み出す
CMPEQ   Rt, #0       // Rtが0と一致するかを判定する
LDRBEQ  Rt, [Rn, #3] // Rtに、Rn+3番地の内容8bitを読み出す
CMPEQ   Rt, #0       // Rtが0と一致するかを判定する

This cuts down on the fast-path latency, as well as potentially reducing bus contention in contended cases.

これにより、fast-pathのレイテンシーを削減でき、競合するケースでの、潜在的なバス競合を減らす事ができます。

The optimisation relies on the fact that the ARM memory system guarantees coherency between overlapping memory accesses of different sizes, similarly to many other architectures. Note that we do not care which element of currently_voting appears in which bits of Rt, so there is no need to worry about endianness in this optimisation.

最適化は、ARM memory systemが異なるサイズで重複するメモリアクセス間のcoherencyを保証するという事実に依存しています。他の多くのアーキテクチャと同様に。 Rtのどのビットにcurrent_votingのどの要素が現れるかは関係ないので、この最適化でエンディアンについて心配する必要はない。

If there are too many CPUs to read the currently_voting array in one transaction then multiple transations are still required. The implementation uses a simple loop of word-sized loads for this case. The number of transactions is still fewer than would be required if bytes were loaded individually.

CPUの数が非常に多く、currently_voting 配列を1回のトランザクションで読み取れないならば、複数回のトランザクションが必要になります。この実装では、word-sizeのloadを単純に繰り返します。トランザクションの数は、byte単位で個別に読み出したのに比べれば、少なくなります。

In principle, we could aggregate further by using LDRD or LDM, but to keep the code simple this was not attempted in the initial implementation.

根本的には、LDRDやLDMを使う事で、更に集約する事ができました。しかし、初期実装においては、コードを簡潔にするために採用されませんでした。

・vlocks are currently only used to coordinate between CPUs which are unable to enable their caches yet. This means that the implementation removes many of the barriers which would be required when executing the algorithm in cached memory.

・vlockは現在、まだキャッシュを有効にできないCPU間の調整にのみ使用されています。キャッシュされたメモリでアルゴリズムを実行するには、実装では多くの障壁を取り除く必要があることを意味します。

packing of the currently_voting array does not work with cached memory unless all CPUs contending the lock are cache-coherent, due to cache writebacks from one CPU clobbering values written by other CPUs. (Though if all the CPUs are cache-coherent, you should be probably be using proper spinlocks instead anyway).

他のCPUから書き込まれた値を別のCPUからキャッシュに書き戻すことに起因して、すべてのCPUでのlockが、cache-coherentで競合しない限り、currently_voting配列のpackingはキャッシュされたメモリでは動作できません。（全てのCPUがcache-coherentである場合でも、適切なスピンロックを使用しているはずである）。

・The “no votes yet” value used for the last_vote variable is 0 (not -1 as in the pseudocode). This allows statically-allocated vlocks to be implicitly initialised to an unlocked state simply by putting them in .bss.

last_vote 変数に対して利用される、"投票なし(no votes yet)"の値は0です（疑似コードのように-1ではありません）。これにより、静的に配置されたvlockによって, .bssに置かれただけで暗黙的にunlockされた初期状態にします。

An offset is added to each CPU’s ID for the purpose of setting this variable, so that no CPU uses the value 0 for its ID.

オフセットが、この変数を設定するために、各CPUのIDに追加されるため、CPUはIDに値0を使用しません。

Colophon

Originally created and documented by Dave Martin for Linaro Limited, for use in ARM-based big.LITTLE platforms, with review and input gratefully received from Nicolas Pitre and Achin Gupta. Thanks to Nicolas for grabbing most of this text out of the relevant mail thread and writing up the pseudocode.

References

[1] Lamport, L. “A New Solution of Dijkstra’s Concurrent Programming Problem”, Communications of the ACM 17, 8 (August 1974), 453-455. https://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm

[2] linux/arch/arm/common/vlock.S, www.kernel.org.

もともと、Linux Kernelのソースコードの一部なので、GPLv2扱いになる（はずの認識）。

https://www.kernel.org/doc/html/latest/index.html

Licensing documentation

The following describes the license of the Linux kernel source code (GPLv2), how to properly mark the license of individual files in the source tree, as well as links to the full license text.

https://www.kernel.org/doc/html/latest/process/license-rules.html#kernel-licensing

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up