More than 1 year has passed since last update.

Lucene で simpleTextCodec がどう生成されるか見てみる。

Posted at 2022-10-23

Lucene の IndexWriter を最初に調べていた時、codec とか意味不明な言葉が多すぎたので、まとめました。

Lucene の simpleTextCodec とは？

lucene の Codec とは、インデックスファイルの読み書きをする機能ですが、その中でも一番内容が分かりやすいものになっているのが、simpleTextCodec です。
lucene では addDocument をする度に、インデックスファイルの内容を初期化かするかファイルの内容を追加します。
simpleTextCodec は例えば下のようなコードをすると、lucene-plaintext のフォルダの直下にインデックスファイルが生成されます。これを行っているのが、simpleTextCodec です。
※ ちなみに、デフォルトの codec である lucene94 (2022/10/23現在)ではインデックスファイルはバイナリファイルになります。

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.io.IOException;
import java.io.File;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.codecs.simpletext.SimpleTextCodec;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class LuceneSimpleTextCodecSample {

	private static File plaintextDir;
	private static String INDEX_ROOT_FOLDER = "/Users/coffeecup/Documents/programming/Java/lucene/lucene/lucene/core/src/java";

    private static File assureDirectoryExists(File dir) {
        if (!dir.exists()) {
            dir.mkdirs();
        }
        return dir;
    }

	public static void main(String[] args) {
		plaintextDir = assureDirectoryExists(new File(INDEX_ROOT_FOLDER, "lucene-plaintext"));
		// 
		StandardAnalyzer analyzer = new StandardAnalyzer();
		IndexWriterConfig config = new IndexWriterConfig(analyzer);
		config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
+		config.setCodec(new SimpleTextCodec());
		try {
			Directory luceneDir = FSDirectory.open(plaintextDir.toPath());
			IndexWriter writer = new IndexWriter(luceneDir, config);
			writer.addDocument(Arrays.asList(
					new TextField("title", "The title of my first document", Store.YES),
					new TextField("content", "The content of the first document", Store.YES)
			));
			writer.addDocument(Arrays.asList(
					new TextField("title", "The title of my second document", Store.YES),
					new TextField("content", "The content of the second document", Store.YES)
			));
			writer.close();
		} catch (IOException e) {
			System.out.println(e);
		}
	}
}

このファイルを実行すると、

simpleTextCodecが出力するファイル

cfs entry for: _b.fld
doc 0
  field 0
    name title
    type string
    value The title of my first document
  field 1
    name content
    type string
    value The content of the first document
doc 1
  field 0
    name title
    type string
    value The title of my second document
  field 1
    name content
    type string
    value The content of the second document
END
checksum 00000000003384965024
cfs entry for: _b.inf
number of fields 2
  name title
  number 0
  index options DOCS_AND_FREQS_AND_POSITIONS
  term vectors false
  payloads false
  norms true
  doc values NONE
  doc values gen -1
  attributes 0
  data dimensional count 0
  index dimensional count 0
  dimensional num bytes 0
  vector number of dimensions 0
  vector encoding FLOAT32
  vector similarity EUCLIDEAN
  soft-deletes false
  name content
  number 1
  index options DOCS_AND_FREQS_AND_POSITIONS
  term vectors false
  payloads false
  norms true
  doc values NONE
  doc values gen -1
  attributes 0
  data dimensional count 0
  index dimensional count 0
  dimensional num bytes 0
  vector number of dimensions 0
  vector encoding FLOAT32
  vector similarity EUCLIDEAN
  soft-deletes false
checksum 00000000002738111167
cfs entry for: _b.len
field title
  type NUMERIC
  minvalue 6
  pattern 0
0
T
0
T
field content
  type NUMERIC
  minvalue 6
  pattern 0
0
T
0
T
END
checksum 00000000003922424621
cfs entry for: _b.pst
field content
  term content
    doc 0
      freq 1
      pos 1
    doc 1
      freq 1
      pos 1
  term document
    doc 0
      freq 1
      pos 5
    doc 1
      freq 1
      pos 5
  term first
    doc 0
      freq 1
      pos 4
  term of
    doc 0
      freq 1
      pos 2
    doc 1
      freq 1
      pos 2
  term second
    doc 1
      freq 1
      pos 4
  term the
    doc 0
      freq 2
      pos 0
      pos 3
    doc 1
      freq 2
      pos 0
      pos 3
field title
  term document
    doc 0
      freq 1
      pos 5
    doc 1
      freq 1
      pos 5
  term first
    doc 0
      freq 1
      pos 4
  term my
    doc 0
      freq 1
      pos 3
    doc 1
      freq 1
      pos 3
  term of
    doc 0
      freq 1
      pos 2
    doc 1
      freq 1
      pos 2
  term second
    doc 1
      freq 1
      pos 4
  term the
    doc 0
      freq 1
      pos 0
    doc 1
      freq 1
      pos 0
  term title
    doc 0
      freq 1
      pos 1
    doc 1
      freq 1
      pos 1
END
checksum 00000000001990244897
table of contents, size: 4
  filename: _b.fld
    start: 22
    end: 408
  filename: _b.inf
    start: 430
    end: 1207
  filename: _b.len
    start: 1229
    end: 1385
  filename: _b.pst
    start: 1407
    end: 2427
table of contents begins at offset: 0000000000000002427

のようなファイルが出力されます。

field title
  term document
    doc 0
      freq 1
      pos 5
    doc 1
      freq 1
      pos 5

のような部分が転置インデックスを示していますが、今回の記事では、このファイルの転置インデックス部分がどう生成されているかを見てみます。

writer.AddDocument

このインデックスファイルへの書き込みを行っているのが、writer.AddDocument になりますが、その中身はこの記事などを見れば、 IndexWriter.java から始まり IndexingChain.java に実装が書いてあることが分かります。その実体は、この記事の最後の方を見れば分かりますが、term の文字列のバイトコードを突っ込んだbyte[] とそれに対応する textStart[termID] の配列になります。

では、実際に読んでみます。
まずは、IndexingChain.java の processDocument が DocumentsWriterPerThread から引き継いだ関数でした。この processDocument を見ると、最初に startStoredFields をしています。この startStoredFields の中の storedFieldsConsumer でファイルへの書き込みをしています。

IndexingChain.java の 500〜507行目

  private void startStoredFields(int docID) throws IOException {
    try {
+     storedFieldsConsumer.startDocument(docID);
    } catch (Throwable th) {
      onAbortingException(th);
      throw th;
    }
  }

storedFieldsConsumer の startDocument では、codecs の writer を初期化し、 writer.startDocument(); を呼び出しています。この storedFieldsConsumer の初期化を IndexingChain で見ると、

      storedFieldsConsumer =
          new StoredFieldsConsumer(indexWriterConfig.getCodec(), directory, segmentInfo);

のように、config.setCodec(new SimpleTextCodec()); で simpleTextCodec に設定した codec が indexWriterConfig.getCodec() で得られる形になるので、ここではデフォルトの lucene94 ではなくて、simpleTextCodec を使うことになります。

この simpleTextCodec の storedFieldWriter があるのが、SimpleTextStoredFieldsWriter.java になって、ここの startDocument が storedFieldsConsumer.startDocument(docID) の中の writer.startDocument(); になります。

  @Override
  public void startDocument() throws IOException {
    write(DOC);
    write(Integer.toString(numDocsWritten));
    newLine();

    numDocsWritten++;
  }

これと同じ具合で、processField の中の storedFieldsConsumer を追うと、startDocument と writeField と finishDocument があります。

その実装の、実体がここにあります。

内容は見てみると簡単なので、是非とも見てみてはいかがでしょうか？

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up