More than 1 year has passed since last update.

Goでdiffコマンド書いて実際の実装と比べてみた

Posted at 2022-12-05

はじめに

こんにちは。にしやまです。
この記事はTechCommit AdventCalendar2022の6日目の記事です。
昨日はTomoさんの【振り返り】2022年・週1回の輪読会を継続して良かった点・改善点でした。

概要

アドカレのネタどうしようと考えた結果、Goでdiffコマンドを実装してみようと思ったので、

実装してみたものの解説
実際のdiffコマンドの実装がどうなっているか読んで比べてみる

について書いてみます。

実装してみたものの解説

わりと手抜きなので、オプションなしで2ファイルの全比較のみしかできません。
また、実際のdiffコマンドのコードはこの時点では見ずに想像で書いているので実際のものとは実装が結構違うかもしれません（この後でそれを検証します）

コード

こちらです。
コメントで多めに解説も書いています。

package main

import (
	"bufio"
	"fmt"
	"os"
)

func main() {
	// 引数確認
	if len(os.Args) != 3 {
		fmt.Println("引数は2つ必要です")
		os.Exit(1)
	}

	// 変換処理
	// [0]にはコマンド名が入る
	os.Exit(diff(os.Args[1], os.Args[2]))
}

// diff 2つのファイルの差分を調べる
func diff(target1, target2 string) int {
	// 引数のディレクトリ/ファイルの存在チェック
	if _, err := os.Stat(target1); err != nil {
		fmt.Println("1つ目の指定ファイルが見つかりません")
		// 異常終了の終了コード
		return 1
	}
	if _, err := os.Stat(target2); err != nil {
		fmt.Println("2つ目の指定ファイルが見つかりません")
		return 1
	}

	// ファイル開く
	target1data, err := os.Open(target1)
	if err != nil {
		fmt.Println("1つ目の指定ファイルを開くのに失敗しました")
		return 1
	}
	// 処理後確実にファイルを閉じる為にdeferでCloseさせる
	defer target1data.Close()
	// fmtでもファイル読めるがbufioの方が早い
	scanner1 := bufio.NewScanner(target1data)

	target2data, err := os.Open(target2)
	if err != nil {
		fmt.Println("2つ目の指定ファイルを開くのに失敗しました")
		return 1
	}
	defer target2data.Close()
	scanner2 := bufio.NewScanner(target2data)

	// 各ファイルのdiffを格納しておくmap
	diffLineMap1 := make(map[int]string)
	diffLineMap2 := make(map[int]string)

	line := 0
	// 差分を測るためのループ
	// Scanで1行ずつスキャンする（レスポンスはbool）
	for scanner1.Scan() {
		line++

		// Textでスキャンした行の文字列を取得する
		str1 := scanner1.Text()
		var str2 string
		if isReadFile2 := scanner2.Scan(); isReadFile2 {
			str2 = scanner2.Text()
		} else {
			// str2の方が行数が短くて取得できなかった場合
			diffLineMap1[line] = str1
			continue
		}

		// どちらも文字列をとれたら比較する
		if str1 != str2 {
			diffLineMap1[line] = str1
			diffLineMap2[line] = str2
		}
	}
	// 2つ目の方が行数が長い場合もある。1のループ終わってて2のスキャンがまだできればこちらの方が長い
	for scanner2.Scan() {
		line++
		str2 := scanner2.Text()
		// 1より2の方が行数が長い場合
		diffLineMap2[line] = str2
	}

	// 差分を出力
	for i := 0; i <= line; i++ {
		// mapの中身の存在チェック（2つ目のレスポンスで値があるかのboolが返る）
		v1, ok1 := diffLineMap1[i]
		v2, ok2 := diffLineMap2[i]
		if ok1 && ok2 {
			// どちらにも差分のある行
			fmt.Println(fmt.Sprintf("%d:%d", i, i))
			fmt.Println("< " + v1)
			fmt.Println("---")
			fmt.Println("> " + v2)
		} else if ok1 {
			// 1の方が行数が長い場合
			fmt.Println(i)
			fmt.Println("< " + v1)
		} else if ok2 {
			// 2の方が行数が長い場合
			fmt.Println(i)
			fmt.Println("> " + v2)
		}
	}
	return 0
}

テストデータとして以下を用意しました。

sample1.txt

aaaaa
bbbbb
ccccc
ddddd
abcdefa
abcdef
as
as
as

sample2.txt

aaaaaa
bbbbb
cccccc
ddddc
abcdef
abcdef
abcdef

実行結果

実行してみます。

[nishiyama:go-practice-diff]$ go run main.go sample1.txt sample2.txt
1:1
< aaaaa
---
> aaaaaa
3:3
< ccccc
---
> cccccc
4:4
< ddddd
---
> ddddc
5:5
< abcdefa
---
> abcdef
7:7
< as
---
> abcdef
8
< as
9
< as
[nishiyama:go-practice-diff]$

ちゃんと2ファイルのdiffを出せています！やったね

実際のdiffコマンドと比較してみます。

[nishiyama:go-practice-diff]$ diff sample1.txt sample2.txt          
1c1
< aaaaa
---
> aaaaaa
3,5c3,5
< ccccc
< ddddd
< abcdefa
---
> cccccc
> ddddc
> abcdef
7,9c7
< as
< as
< as
---
> abcdef
[nishiyama:go-practice-diff]$

ちょっと出力の仕方が違いますね。
diffコマンドの方は連続する行で差分があった際にはそこをまとめて出力してくれていて親切です。
僕はそこを実装するのめんどくさくて手を抜きました。
あと試してないですが、大きいファイルだと実行に失敗する気がしてます。
まぁ一応差分は取れたので満足です。

解説は以上です。
次は実際のdiffコマンドの実装を見てみましょう。

実際のdiffコマンドの実装がどうなっていたか答え合わせ

僕の環境(macOS Monterey 12.2.1, Apple M1 Max)はGNUのdiffの2.8.1だったので同じバージョンをダウンロードしました。

[nishiyama:go-practice-diff]$ diff -v
diff (GNU diffutils) 2.8.1
Copyright (C) 2002 Free Software Foundation, Inc.

This program comes with NO WARRANTY, to the extent permitted by law.
You may redistribute copies of this program
under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING.

Written by Paul Eggert, Mike Haertel, David Hayes,
Richard Stallman, and Len Tower.
[nishiyama:go-practice-diff]$

ソースコードは以下のリンクからダウンロードできます。

https://www.gnu.org/software/diffutils/
https://ftp.gnu.org/gnu/diffutils/

早速漁ってみましょう。

※この先読み違いなどあるかもしれないです。もしあった場合ご指摘いただけたら嬉しいです🙏

おそらくdiffutils-2.8.1/src/diff.cがそうでしょう。
C言語は書けないし読めないので雰囲気でいきます。
オプションは使ってないので出来るだけ比較ロジックのところだけ見ていきます。
以下、わかりにくくなるのでソースコード部分は「// ...中略」を除いて書いてあるままです。

コード追ってみます

compare_filesの呼び出し

どうやらここっぽいですね

int main(int argc, char **argv) {
  
  // ...中略

  if (from_file) {
    if (to_file)
      fatal("--from-file and --to-file both specified");
    else
      for (; optind < argc; optind++) {
        int status =
            compare_files((struct comparison *)0, from_file, argv[optind]);
        if (exit_status < status)
          exit_status = status;
      }
  } else {
    if (to_file)
      for (; optind < argc; optind++) {
        int status =
            compare_files((struct comparison *)0, argv[optind], to_file);
        if (exit_status < status)
          exit_status = status;
      }
    else {
      if (argc - optind != 2) {
        if (argc - optind < 2)
          try_help("missing operand after `%s'", argv[argc - 1]);
        else
          try_help("extra operand `%s'", argv[optind + 2]);
      }

      exit_status =
          compare_files((struct comparison *)0, argv[optind], argv[optind + 1]);
    }
  }

  /* Print any messages that were saved up for last.  */
  print_message_queue();

  check_stdout();
  exit(exit_status);
  return exit_status;
}

compare_filesの定義

compare_files()という関数にファイルを渡して差分をチェックしてるみたいです。
もうちょっと追ってみましょう。

/* Compare two files (or dirs) with parent comparison PARENT
   and names NAME0 and NAME1.
   (If PARENT is 0, then the first name is just NAME0, etc.)
   This is self-contained; it opens the files and closes them.

   Value is EXIT_SUCCESS if files are the same, EXIT_FAILURE if
   different, EXIT_TROUBLE if there is a problem opening them.  */

static int compare_files(struct comparison const *parent, char const *name0,
                         char const *name1) {

2, 3個目の引数にchar型でファイル名を受け取っていますね。
1つ目の引数で受け取っているcomparisonという型の構造体はファイルを再帰的に比較するための型（ユーザー定義型）のようです。

/* Data on two input files being compared.  */

struct comparison
  {
    struct file_data file[2];
    struct comparison const *parent;  /* parent, if a recursive comparison */
  };

読み進めます。

static int compare_files(struct comparison const *parent, char const *name0,
                         char const *name1) {
  struct comparison cmp;

  // ...中略

  if (!parent) {
    free0 = 0;
    free1 = 0;
    cmp.file[0].name = name0;
    cmp.file[1].name = name1;
  } else {
    cmp.file[0].name = free0 = dir_file_pathname(parent->file[0].name, name0);
    cmp.file[1].name = free1 = dir_file_pathname(parent->file[1].name, name1);
  }

  // ...中略

    }
  else
    {
      /* Compare the files, if no error was found.  */

      if (status == EXIT_SUCCESS)
	status = diff_2_files (&cmp);

struct comparison cmp;で比較対象を入れておく為の構造体を定義し、cmp.file[0].name = name0;でファイル名を入れています。
色々割愛しますがその後ファイルが比較できる状態かどうかを調べcmpの中に情報を入れたりしています。

そしてファイルが比較できる状態であれば、diff_2_files (&cmp);で比較を実行します。

diff_2_filesの定義

Now do the main comparison algorithm, considering just the
undiscarded lines.

と書いてあるところがあるのでそろそろdiffのロジックのメインの箇所に辿り着きそうです。

/* Report the differences of two files.  */
int
diff_2_files (struct comparison *cmp)
{
    // ...中略

      /* Now do the main comparison algorithm, considering just the
	 undiscarded lines.  */

      xvec = cmp->file[0].undiscarded;
      yvec = cmp->file[1].undiscarded;
      diags = (cmp->file[0].nondiscarded_lines
	       + cmp->file[1].nondiscarded_lines + 3);
      fdiag = xmalloc (diags * (2 * sizeof *fdiag));
      bdiag = fdiag + diags;
      fdiag += cmp->file[1].nondiscarded_lines + 1;
      bdiag += cmp->file[1].nondiscarded_lines + 1;

      /* Set TOO_EXPENSIVE to be approximate square root of input size,
	 bounded below by 256.  */
      too_expensive = 1;
      for (;  diags != 0;  diags >>= 2)
	too_expensive <<= 1;
      too_expensive = MAX (256, too_expensive);

      files[0] = cmp->file[0];
      files[1] = cmp->file[1];

      compareseq (0, cmp->file[0].nondiscarded_lines,
		  0, cmp->file[1].nondiscarded_lines, minimal);

      free (fdiag - (cmp->file[1].nondiscarded_lines + 1));

      /* Modify the results slightly to make them prettier
	 in cases where that can validly be done.  */

      shift_boundaries (cmp->file);

      /* Get the results of comparison in the form of a chain
	 of `struct change's -- an edit script.  */

      if (output_style == OUTPUT_ED)
	script = build_reverse_script (cmp->file);
      else
	script = build_script (cmp->file);

    // ...中略

    print_normal_script (script);

    // ...中略

ここはメインロジックの呼び出しから結果の出力までのコードを抽出しました。

compareseq (0, cmp->file[0].nondiscarded_lines,
    0, cmp->file[1].nondiscarded_lines, minimal);

の箇所で2ファイルの差分を比較しているようです。
その後の

shift_boundaries (cmp->file);

の差分の内容を行で結合する処理をしているようです。（僕がめんどくさがってやらなかったこと）

そして

script = build_script (cmp->file);

で差分の内容を出力用に整形して

print_normal_script (script);

で出力しているようです。

次はcompareseq()の中身を見ていきます。

compareseqの定義（diff取得コアロジック）

差分比較のコアロジックの関数です。

static void
compareseq (lin xoff, lin xlim, lin yoff, lin ylim, bool find_minimal)
{
  lin * const xv = xvec; /* Help the compiler.  */
  lin * const yv = yvec;

  /* Slide down the bottom initial diagonal. */
  while (xoff < xlim && yoff < ylim && xv[xoff] == yv[yoff])
    ++xoff, ++yoff;
  /* Slide up the top initial diagonal. */
  while (xlim > xoff && ylim > yoff && xv[xlim - 1] == yv[ylim - 1])
    --xlim, --ylim;

  /* Handle simple cases. */
  if (xoff == xlim)
    while (yoff < ylim)
      files[1].changed[files[1].realindexes[yoff++]] = 1;
  else if (yoff == ylim)
    while (xoff < xlim)
      files[0].changed[files[0].realindexes[xoff++]] = 1;
  else
    {
      lin c;
      struct partition part;

      /* Find a point of correspondence in the middle of the files.  */

      c = diag (xoff, xlim, yoff, ylim, find_minimal, &part);

      if (c == 1)
	{
	  /* This should be impossible, because it implies that
	     one of the two subsequences is empty,
	     and that case was handled above without calling `diag'.
	     Let's verify that this is true.  */
	  abort ();
#if 0
	  /* The two subsequences differ by a single insert or delete;
	     record it and we are done.  */
	  if (part.xmid - part.ymid < xoff - yoff)
	    files[1].changed[files[1].realindexes[part.ymid - 1]] = 1;
	  else
	    files[0].changed[files[0].realindexes[part.xmid]] = 1;
#endif
	}
      else
	{
	  /* Use the partitions to split this problem into subproblems.  */
	  compareseq (xoff, part.xmid, yoff, part.ymid, part.lo_minimal);
	  compareseq (part.xmid, xlim, part.ymid, ylim, part.hi_minimal);
	}
    }
}

この記事では詳細については書きませんが、このGNUのdiffコマンドにはMyersによるO(ND)アルゴリズムが使われているらしく、このコアロジック部分がそれのようです。
僕は不勉強ゆえアルゴリズムに詳しくないので、このあたりの解説は参考となりそうな先人の記事を記載することで代わりとさせてください。

差分検出アルゴリズム三種盛り
 diffの動作原理を知る～どのようにして差分を導き出すのか
 diffってなんだ

MyersによるO(ND)アルゴリズムの他には、DP(動的計画法)、WuによるO(ND)アルゴリズム、などがdiffのアルゴリズムによく使われるそうです。

深掘りは一旦ここまでにしておこうと思います。

追ったコードのまとめ

これまでの処理をまとめると

main()
- 処理実行
- exit_status = compare_files((struct comparison *)0, argv[optind], argv[optind + 1]);を呼び出す
compare_files(struct comparison const *parent, char const *name0, char const *name1)
- 引数にファイル名を受け取りintを返す
  - 第一引数のstruct comparison const *parentにより再帰的な比較が可能
- status = diff_2_files (&cmp);を呼び出す
diff_2_files (struct comparison *cmp)
- 引数に2ファイル分の情報を入れたcomparison型の参照を渡して比較しintを返す
- compareseq (0, cmp->file[0].nondiscarded_lines, 0, cmp->file[1].nondiscarded_lines, minimal);を呼び出し2ファイルの差分比較を行う
  - compareseq()を再帰的に呼び出すことでファイルの全内容を比較している
- shift_boundaries (cmp->file);により差分の内容を行ごとにまとめている
- print_normal_script (script);差分を出力

という感じになります。

自分の書いたものと比較

対して僕の書いたものは

2ファイルを開く
それぞれ1行ずつ読み取る
読み取ったものの文字列を比較
一致しなければ差分とする

程度なので、ちゃんとしたdiffを取って且つ処理速度も担保するにはMyersやWu、DP(動的計画法)などdiffを取るのに適したアルゴリズムを実装する必要が合ったでしょう。
また、出力もまとまっていた方が優しいので、shift_boundaries (cmp->file);の関数のように行ごとでまとめる工程も必要でしょう。
我ながら、これだけじゃ及第点をあげる訳にはいかないなぁという感じですね。（不勉強が露呈して恥ずかしい...）

感想

アルゴリズムって難しいなぁと思いました。
正直後半のcompareseq()あたりから書いてあることで何をしてるのかいまいち理解できてないですし、上に貼った記事を読んでもMyersによるO(ND)アルゴリズムについてあまり理解できていない気がしてます。
大きなサイズのファイルを処理するとか、行の文字ごとの差分も取得するとか、そういうところまで考えて実用レベルにしようとすると適当に実装するだけでは全然ダメですね。
アルゴリズムを使うべき場面が自分の中で少なくとも1つ明確になりました。
もっと勉強して今度は「GoでMyersのdiffを実装してみた」が書けるようにがんばります！

ここまで読んでいただいてありがとうございました！
7日目のTechCommit AdventCalendar2022の記事もお楽しみに！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up