More than 5 years have passed since last update.

JavaでBoyer-Mooreの実装

Last updated at 2019-12-01Posted at 2019-12-01

アルゴリズムの勉強、Javaの文法を勉強するために、文字列検索アルゴリズムの一つであるBoyer-Moore法を実装しました！！

Boyer-Moore法とは？

ボイヤー・ムーア法（１９７６年）
R.S.Boyer と J.S.Moore が提案
単純検索アルゴリズムよりも高速なアルゴリズム
文字の比較結果によってずらす位置を変える
時間計算量は単純検索アルゴリズム以下

他にも文字列検索のアルゴリズムは、単純検索法とかラビンカープ法等色々あります。

Skip表

検索する前にスキップ表というものを作ります。

検索パターン：ighi

だったとして、

パターン	i	g	h	i
ずらし幅	3	2	1	0

となりますが、重複する文字列があった場合、小さい方を優先します。

したがって、

パターン	g	h	i	デフォルト
ずらし幅	2	1	0	4

となります。

デフォルトは、文字比較が重複しないところまで、Skipするために、設けています。
パターンを自然数でカウントした時の値です。

検索例

1回目

検索対象	a	b	c	d	e	f	g	i	g	h	i
パターン	i	g	h	i

検索は先頭からはじめ、検索対象とパターンの末尾を比較します。
検索対象の文字列と検索対象の末尾は、「d」と「i」なので、デフォルトの「４」つ分Skipします。

2回目

検索対象	a	b	c	d	e	i	g	i	g	h	i
パターン					i	g	h	i

「i」と「i」で、Skip表は「0」なので、隣の文字を比較します。
「g」と「h」なので、異なっています。
Skip表を見ると、「g」の時、Skipは「1」です。
１個分だけSkipします。

3回目

検索対象	a	b	c	d	e	i	g	i	g	h	i
パターン						i	g	h	i

「g」と「i」で、Skip表は「2」なので、2個分Skipします。

4回目

検索対象	a	b	c	d	e	i	g	i	g	h	i
パターン								i	g	h	i

「i」と「i」で、Skip表は「0」なので、隣の文字列を比較します。
「h」と「h」で一緒なので、隣の文字列を比較します。
「g」と「g」で一緒なので、隣の文字列を比較します。
「i」と「i」で一緒なので、検索終了です。

イメージを掴めたところで、ここからコードで実装していきたいと思います。

変数定義

public final static int bmTableSize = 1_114_111;
public final static String txt = "あいうえおっぱお"; //検索対象
public final static char[] txtChar = txt.toCharArray(); //char型にスプリット
public final static String pattern = "おっぱお"; //パターン
public final static char[] patternChar = pattern.toCharArray();

ASCIIだけしか検索出来ないのは、少しつまらなかったので、日本語にも対応しようかと思います。

JavaのリファレンスCharacter型を見ると、

Java SE APIドキュメンテーションでは、U+0000 - U+10FFFFの範囲の文字値にUnicodeコード・ポイントを使用し、UTF-16エンコーディングのコード単位である16ビットchar値にUnicodeコード単位を使用します。Unicode用語の詳細は、「Unicode Glossary」を参照してください。

文字集合:Unicode
であり、
符号化方式:UTF-16
であることがわかります

なので、10FFFFを10進数に変換、テーブルサイズ(bmTableSize)を1,114,111にしました。

テーブル作成

BoyerMooreTableInit

public static int[] BoyerMooreTableInit(int[] table, char[] patternChar, int ptnLen) {
        int count = 0;

        /* パターンに無い文字はパターン文字列長をずらし幅にする */ ・・・①
        for(count = 0; count < bmTableSize; count++){
            table[count] = ptnLen;
        }

        /* パターンに含まれる文字のずらし幅を決定する */　・・・②
        for(count = 0; count < ptnLen; count++){
            table[(int)patternChar[count]] = ptnLen - count - 1;
        }

        /* デバッグ出力 */
        System.out.printf("[table]  : default: step=%d\n", ptnLen);
        for(count = 0; count < bmTableSize; count++){
            if(table[count] != ptnLen)
                System.out.printf("         : char=%c: table[%03d]: step=%d\n",
                                 (char)count,count,(int)table[count]);
        }
   return table;
}

Skip表ですが、

パターン	お	っ	ぱ	お
ずらし幅	3	2	1	0

重複する文字があった場合、小さい方を優先するため

パターン	っ	ぱ	お	デフォルト
ずらし幅	2	1	0	4

最終的にこうなります。

①に関して、全てテーブルの中に、デフォルトの「4」を入れています。

②に関して、table[(int)patternChar[count]] = ptnLen - count - 1;は、

patternChar[count]には、[お,ぱ,っ,お]が入ってます。
これをint型にキャストしてあげると、UTF-16を10進数に変換した値が返ってきます。

「っ」　・・・　table[12387] = 2
「ぱ」　・・・　table[12401] = 1
「お」　・・・　table[12362] = 0
「デフォルト」　・・・　table[上の３つ以外の場所] = 4

検索実施

BoyerMooreSearch

public static int BoyerMooreSearch(char[] txtChar, char[] patternChar) {
        int table[] = new int[bmTableSize];
        int txtLen = 0;
        int ptnLen = 0;
        int i = 0; /* テキストの比較位置 */
        int j = 0; /* パターンの比較位置 */

        txtLen = txtChar.length;
        ptnLen = patternChar.length;

        /* ずらし表を作成する */
        table = BoyerMooreTableInit(table, patternChar, ptnLen);

        /* 比較処理 */
        i = j = ptnLen - 1; /* 比較位置をパターン末尾にする */
        while((i < txtLen) && (j >= 0)){
            PrintCompareProcess(txt, pattern, i, j);

            if(txtChar[i] != patternChar[j]){
                /* ずらし表を参照して、次の比較位置を設定する */
                i += next_step(table, txtChar[i], (ptnLen - j));
                j = ptnLen - 1;   /* 比較位置をパターン末尾にする */
            }else{
                /* 文字が一致したので、前の文字を照合していく */
                j--;
                i--;
            }
        }

        if(j < 0) {
            return i + 2;
        }else {
            return 0;
        }
    }
}

・１回目比較

検索対象	あ	い	う	え	ぱ	っ	お	っ	ぱ	お
パターン	お	っ	ぱ	お

末尾「え」と「お」が異なっており、かつパターンにも存在しない文字列なので、
４文字スキップします。

・２回目比較

検索対象	あ	い	う	え	ぱ	っ	お	っ	ぱ	お
パターン					お	っ	ぱ	お

末尾がパターンの中に存在する文字で、「っ」＝ 2なので、
2文字文スキップします。

・3回目比較

検索対象	あ	い	う	え	ぱ	っ	お	っ	ぱ	お
パターン							お	っ	ぱ	お

末尾がパターンの中に存在する文字で、「ぱ」＝ 1なので、
1文字文スキップします。

・４回目比較

検索対象	あ	い	う	え	ぱ	っ	お	っ	ぱ	お
パターン							お	っ	ぱ	お

末尾が「お」 = 0　なので、隣の文字を検索します。

・・・以下繰り返し

最終的に、こんな感じです。

コード

Main.java

import java.io.UnsupportedEncodingException;

public class Main {
	public final static int bmTableSize = 1_114_111;
	public final static String txt = "あいうえぱっおっぱお";
	public final static char[] txtChar = txt.toCharArray();
	public final static String pattern = "おっぱお";
	public final static char[] patternChar = pattern.toCharArray();

	public static void main(String[] args) throws UnsupportedEncodingException {
		int result;

		System.out.printf("[text]   :%s\n", txt);
		System.out.printf("[pattern]:%s\n", pattern);

		result = BoyerMooreSearch(txtChar, patternChar);

		if (0 < result) {
			System.out.println(result + "番目にあったよ！！");
		}else {
			System.out.println("見つからなかったよ");
		}

	}

	public static int[] BoyerMooreTableInit(int[] table, char[] patternChar, int ptnLen) {
		int count = 0;

	    /* パターンに無い文字はパターン文字列長をずらし幅にする */
	    for(count = 0; count < bmTableSize; count++){
	    	table[count] = ptnLen;
	    }

	    /* パターンに含まれる文字のずらし幅を決定する */
	    for(count = 0; count < ptnLen; count++){
	    	table[(int)patternChar[count]] = ptnLen - count - 1;
	    }

	    /* デバッグ出力 */
	    System.out.printf("[table]  : default: step=%d\n", ptnLen);
	    for(count = 0; count < bmTableSize; count++){
	        if(table[count] != ptnLen)
	        	System.out.printf("         : char=%c: table[%03d]: step=%d\n",
	        					 (char)count,count,(int)table[count]);
	    }

	    return table;
	}

	public static void PrintCompareProcess(String txt, String pattern, int i, int j) {
	    int count = 0;

	    System.out.printf("-----------------------------------\n");
	    System.out.printf("[compare]:(text i=%d)(pattern j=%d)\n", i+1, j+1);
	    System.out.printf(" text    :%s\n", txt);

	    /* パターンを比較位置で重ねる */
	    System.out.printf(" pattern :");
	    for(count = 0; count < (i - j); count++) System.out.print("　"); //全角半角でずれる。
	    System.out.printf("%s\n", pattern);

	    /* 比較点にマークする */
	    System.out.printf("         :");
	    for(count = 0; count < i; count++) System.out.printf("　");
	    System.out.printf("^\n");
	}

	public static int next_step(int[] table, char target, int remain) {
	    /* ループ防止のために確認する */
	    if(table[(int)target] > remain){
	        /* ずらし表から値を取得する */
	        return(table[(int)target]);
	    }else{
	        /* 照合を開始した地点の次の文字に進める */
	        return(remain);
	    }
	}

	public static int BoyerMooreSearch(char[] txtChar, char[] patternChar) {
		int table[] = new int[bmTableSize];
		int txtLen = 0;
		int ptnLen = 0;
		int i = 0; /* テキストの比較位置 */
		int j = 0; /* パターンの比較位置 */

		txtLen = txtChar.length;
		ptnLen = patternChar.length;

	    /* ずらし表を作成する */
		table = BoyerMooreTableInit(table, patternChar, ptnLen);

	    /* 比較処理 */
	    i = j = ptnLen - 1; /* 比較位置をパターン末尾にする */
	    while((i < txtLen) && (j >= 0)){
	    	PrintCompareProcess(txt, pattern, i, j);

	        if(txtChar[i] != patternChar[j]){
	            /* ずらし表を参照して、次の比較位置を設定する */
	            i += next_step(table, txtChar[i], (ptnLen - j));
	            j = ptnLen - 1;   /* 比較位置をパターン末尾にする */
	        }else{
	            /* 文字が一致したので、前の文字を照合していく */
	            j--;
	            i--;
	        }
	    }

	    if(j < 0) {
	    	return i + 2;
	    }else {
		    return 0;
	    }
	}
}

出力結果

[text]   :あいうえぱっおっぱお
[pattern]:おっぱお
[table]  : default: step=4
         : char=お: table[12362]: step=0
         : char=っ: table[12387]: step=2
         : char=ぱ: table[12401]: step=1
-----------------------------------
[compare]:(text i=4)(pattern j=4)
 text    :あいうえぱっおっぱお
 pattern :おっぱお
         :　　　^
-----------------------------------
[compare]:(text i=8)(pattern j=4)
 text    :あいうえぱっおっぱお
 pattern :　　　　おっぱお
         :　　　　　　　^
-----------------------------------
[compare]:(text i=10)(pattern j=4)
 text    :あいうえぱっおっぱお
 pattern :　　　　　　おっぱお
         :　　　　　　　　　^
-----------------------------------
[compare]:(text i=9)(pattern j=3)
 text    :あいうえぱっおっぱお
 pattern :　　　　　　おっぱお
         :　　　　　　　　^
-----------------------------------
[compare]:(text i=8)(pattern j=2)
 text    :あいうえぱっおっぱお
 pattern :　　　　　　おっぱお
         :　　　　　　　^
-----------------------------------
[compare]:(text i=7)(pattern j=1)
 text    :あいうえぱっおっぱお
 pattern :　　　　　　おっぱお
         :　　　　　　^
7番目にあったよ！！

参考になった方がいれば幸いです。

参考文献

・Java Character型リファレンス
・C言語アルゴリズム-BM法
・Unicode対応文字コード表
・【Java】char型はint型にキャストできる

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up