More than 5 years have passed since last update.

GitのPacking Heuristics

Posted at 2016-12-05

Gitについて勉強していると、GitのDelta compressionがどのような仕組みになっているのかが気になってきます。git.gitにはpack-heuristics.txtという、GitがどうやってオブジェクトをPACKファイルに詰め込んでいるのかについての、IRCからの抜粋があります。

GitにおけるPackingの問題とは

Gitはファイルの各リビジョンを独立したコピーとして持つのではなく、差分として持つことで、ディスクの利用効率をあげています。しかしこれはGitが内部でpatchファイルを持っているということではなく、PACK formatという独自フォーマットによって達成されています。このPACK formatはローカルでのデータ保持以外にも、dumb HTTP protocolを除く各種Gitプロトコルや、git-bundleでも使われています。

PACK formatはGitの4種類のオブジェクトをいい感じにconcatしたフォーマットです。内部では6種類のオブジェクトが存在します。

OBJ_COMMIT    = 1
OBJ_TREE      = 2
OBJ_BLOB      = 3
OBJ_TAG       = 4
OBJ_OFS_DELTA = 6
OBJ_REF_DELTA = 7

OBJ_OFS_DELTAとOBJ_REF_DELTAが差分化されたオブジェクトです。この差分による圧縮をDelta compressionと呼びます。

Delta compressionのフォーマット自体は簡単で、patch-delta.cを眺めれば、コピーと埋め込みリテラルの繰り返しであることがわかります。Delta compressionのアルゴリズム自体はdiff-delta.cを見る限り、Rabin fingerprintをつかって適当にインデックスを作って、それを元に一致長が長そうな部分を探す、といった感じです。

このPACK formatには次のような問題が考えられます。

どのようにすれば、より効率がよい(より速い or より圧縮効率のよい) delta_compression(obj1, obj2)が作れるか。
delta_compression(obj1, obj2)が与えられたときに、どのようにすれば圧縮率の高いobj1とobj2を探せるか。
どのような順番でオブジェクトをPACKファイルに詰め込めばI/O効率が良くなるのか。

最初の効率のよいdelta_compressionについては、文章の類似度判定と似ている感じがあって、Similarity preserving hash functionあたりで検索すると、似たようなアプローチを探すことができます。pack-heuristics.txtでは、残り2つについてGitがどのようなアプローチを採っているのかについて説明しています。

圧縮率の高い`obj1`と`obj2`を探す

ブルートフォースでO(N^2)なんてできないので、なんとかインチキする必要があります。この部分について、IRCでは次のように述べられています。

<linus> Remember: Git really doesn't follow files. So what it does is

    - generate a list of all objects
    - sort the list according to magic heuristics
    - walk the list, using a sliding window, seeing if an object
      can be diffed against another object in the window
    - write out the list in recency order

つまり、謎の順序でソートした後に、その後に続く数ファイル内でのみ比較して圧縮しているということになります。この謎の順序は次のように説明されています。

<gitster> The quote from the above linus should be rewritten a

    bit (wait for it):
    - first sort by type.  Different objects never delta with
  each other.
    - then sort by filename/dirname.  hash of the basename
      occupies the top BITS_PER_INT-DIR_BITS bits, and bottom
      DIR_BITS are for the hash of leading path elements.
    - then if we are doing "thin" pack, the objects we are _not_
      going to pack but we know about are sorted earlier than
      other objects.
    - and finally sort by size, larger to smaller.

中略

<gitster> That's the sort order.  What this means is:

    - we do not delta different object types.
- we prefer to delta the objects with the same full path, but
      allow files with the same name from different directories.
- we always prefer to delta against objects we are not going
      to send, if there are some.
- we prefer to delta against larger objects, so that we have
      lots of removals.

    The penultimate rule is for "thin" packs.  It is used when

    the other side is known to have such objects.

まず、Gitのオブジェクトの種類で分けます。どうせ違う種類のオブジェクトの類似度が高いという可能性は低いため、これによって、違う種類のオブジェクト同士でDelta compressionをしないようにします。次に、basename/dirnameで比較します。これは面白いのですが、basenameを先に持ってくることで、ファイルの移動を行ったとしても、うまい感じに移動前後のファイルで差分が取られます。Thin packの部分はGitプロトコル特有の部分なので省略。最後にファイルサイズの大きい方から小さい方にソートしていきます。このファイルサイズソートによって、大きいファイルをベースに小さいファイルのdeltaを計算することになります。これは、あとでLinusが説明するのですが、Delta compressionのフォーマット上、バイト列を削除するほうが追加するよりも効率的だからです。

このように、Gitではオブジェクトの種類やファイルパスでうまいところ類似のオブジェクトのあたりをつけているのですが、なんとなくこれだとファイルのリネームに弱いような感じがします……

効率のよいオブジェクトの順番

<linus> The recency ordering (which is basically: put objects

    _physically_ into the pack in the order that they are
    "reachable" from the head) is important.

<njs`> okay

<linus> It's important because that's the thing that gives packs

    good locality. It keeps the objects close to the head (whether
    they are old or new, but they are _reachable_ from the head)
    at the head of the pack. So packs actually have absolutely
    _wonderful_ IO patterns.

PACKファイルにおいて、オブジェクトがどのような順番で格納されるのかというのは重要になってきます。ここで言うRecency orderingというのは、HEADから到達可能なオブジェクトをたどっていったときの順番になります。こうすることによって、checkoutするときに見るべきオブジェクトが一部分に集中するため、cache localityが良くなります。

IRCでnjsは次のような疑問を抱きます。

<njs`> And then once we have picked a delta or fulltext to

    represent each object, we re-sort by recency, and write them
    out in that order.

<linus> Yup. Some other small details:

And of course there is the "Other Shoe" Factor too.

<linus> - We limit the delta depth to another magic value (right

    now both the window and delta depth magic values are just "10")

<njs`> Hrm, my intuition is that you'd end up with really _bad_ IO

    patterns, because the things you want are near by, but to
    actually reconstruct them you may have to jump all over in
    random ways.

<linus> - When we write out a delta, and we haven't yet written

    out the object it is a delta against, we write out the base
    object first.  And no, when we reconstruct them, we actually
    get nice IO patterns, because:
    - larger objects tend to be "more recent" (Linus' law: files grow)
    - we actively try to generate deltas from a larger object to a
      smaller one
    - this means that the top-of-tree very seldom has deltas
      (i.e. deltas in _practice_ are "backwards deltas")

IRCにおけるnjsの疑問は、「deltifyしたあとにrecencyでソートし直すと、たとえほしいオブジェクトが近くにあっても、そのオブジェクトをdeltaから再構築するために遠いオブジェクトを参照しなければいけないのであれば、I/Oパターン最悪なのでは」というものです。

これに対してLinusは、大体の場合Recency orderingとDelta orderingは同じなのだよ、ということを述べています。"Linus' law: files grow"とあるように、ソフトウェアは基本的に増加する方向に成長していきます。ファイルサイズは肥大化していくわけです。これにより、ファイルサイズでソートされるDelta orderingは、HEADからの到達可能順序でソートされるRecency orderingと多くの場合一致するということになります。つまりPACKファイル内部では、古いファイルに足し算される形で差分が保存されるのではなく、新しいファイルから引き算される形で古いファイルが保存されているわけです("backwards deltas")。

この"backwards deltas"はなかなか示唆的で、よりHEADに近いリビジョンというのはあまりDelta uncompressionをしなくてもよく、あまり多くのオブジェクトを参照しなくても良いため、より効率的にcheckoutできる感じがします。古いリビジョンよりもHEADに近い新しいリビジョンをcheckoutすることが多そうであることを考えても、いろいろな意味で効率が良さそうです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

GitのPacking Heuristics

GitにおけるPackingの問題とは

圧縮率の高いobj1とobj2を探す

効率のよいオブジェクトの順番

圧縮率の高い`obj1`と`obj2`を探す