More than 5 years have passed since last update.

git diff で Similarity 50% 未満の変更を rename として見るには

Last updated at 2016-05-21Posted at 2016-05-21

結論

$ man git-diff

したら

$ git diff -M[パーセンテージ]

せよ、とのこと。

実験

起点となる commit を用意

次のような内容の samplefile を用意して commit する
(1 から 10 までの数が 1 行毎に入っている)

samplefile

$ git add samplefile
$ git commit -m 'first-commit'

それに続く commit を用意

次のような内容の samplefile2 を用意する。
(1 から 4 までの数が 1 行毎に入っている。)

samplefile2

$ git rm samplefile
$ git add samplefile2
$ git commit -m 'second-commit'

diff を取る

普通に diff を取ると、別々のファイルとして扱われる。

$ git diff HEAD^

diff --git a/samplefile b/samplefile
deleted file mode 100644
index f00c965..0000000
--- a/samplefile
+++ /dev/null
@@ -1,10 +0,0 @@
-1
-2
-3
-4
-5
-6
-7
-8
-9
-10
diff --git a/samplefile2 b/samplefile2
new file mode 100644
index 0000000..8a1218a
--- /dev/null
+++ b/samplefile2
@@ -0,0 +1,5 @@
+1
+2
+3
+4
+5

-C40 を指定

40% の一致まで閾値を下げてみる。

$ git diff -M40 HEAD^

diff --git a/samplefile b/samplefile2
similarity index 47%
rename from samplefile
rename to samplefile2
index f00c965..8a1218a 100644
--- a/samplefile
+++ b/samplefile2
@@ -3,8 +3,3 @@
 3
 4
 5
-6
-7
-8
-9
-10

どうやら同一のファイルのリネームとして扱ってくれる。
47% の一致なので、 git diff -M47 以下なら同様の結果になる。

git mv してからファイルを編集しても同じ？

git mv → ファイルを編集 → git add
ファイルを編集 → git add → git mv

これらも同じ結果になる。

結局のところ Git が管理するのはその時点のスナップショット であり、 どのような経緯で差分が発生したかまでを保存するわけではない 。

実装はどうなっているのか

GitHub で検索してみたところ、このあたりにツボがありそう。

オプションまたはデフォルト値で閾値を決める

デフォルト値は 50%

候補をスコア順でソート

このあたりを追って読み進めればなんとなく見えてくる。

余談

Unix-like 限定だけど、この実験をするにあたって、 seq コマンドを覚えておくと便利。

$ seq 3 6
3
4
5
6

のように引数を 2 つ指定して実行すると、
1 つめの引数から 2 つめの引数までの連番を 1 行毎に生成してくれる便利なコマンド。

-C オプション

ちなみに -C オプション (--find-copies-harder オプション) というのがあって -M と同じような機能を提供している。ただし rename ではなく copy をも検出するようだ。つまり作業ディレクトリの全てを検索対象にする。

(man git-diff から一部抜粋)

For performance reasons, by default, -C option finds copies only if the original file of the copy was modified in the same changeset. This flag makes the command inspect unmodified files as candidates for the source of copy. This is a very expensive operation for large projects, so use it with caution. Giving more than one -C option has the same effect.

個人的な感想というか仮説

git mv があるのに git cp がない理由もわかった。

rename + modify の検出は delete 予定のファイルのハッシュと diff を取ればいいので、比較する回数は delete 済みファイルの個数に依存するが、 copy + modify の検出は、 copy 元は作業ディレクトリ全ての個数に依存する。これは巨大なプロジェクトでは大きな回数になるので、通常の git diff としては提供したくないからだ。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up