LoginSignup
0
0

More than 5 years have passed since last update.

luceneメモ: Analyzerの動作

Posted at

AnalyzerやTokenizerのメモ。

AnalyzerはTokenStreamを提供するもの。

CharFilter: 前処理。Readerのサブクラス。
Tokenizer: 本処理。
TokenFilter: 後処理。StopWordの削除や小文字化など。

AnalyzerのtokenStream()でTokenStreamを取得する。
tokenStream()の内部でcreateComponents()を呼び出し、そこでTokenizerやTokenFilterを作っている。
修正したい場合は、以下のようにcreateComponentsを修正する。(公式より)
*AnalyzerのtokenStreamはfinal.

 Analyzer analyzer = new Analyzer() {
  @Override
   protected TokenStreamComponents createComponents(String fieldName) {
     Tokenizer source = new FooTokenizer(reader);
     TokenStream filter = new FooFilter(source);
     filter = new BarFilter(filter);
     return new TokenStreamComponents(source, filter);
   }
   @Override
   protected TokenStream normalize(TokenStream in) {
     // Assuming FooFilter is about normalization and BarFilter is about
     // stemming, only FooFilter should be applied
     return new FooFilter(in);
   }
 };

StandardAnalyzerの場合

  @Override
  protected TokenStreamComponents createComponents(final String fieldName) {
    final StandardTokenizer src = new StandardTokenizer();
    src.setMaxTokenLength(maxTokenLength);
    TokenStream tok = new StandardFilter(src);
    tok = new LowerCaseFilter(tok);
    tok = new StopFilter(tok, stopwords);
    return new TokenStreamComponents(src, tok) {
      @Override
      protected void setReader(final Reader reader) {
        src.setMaxTokenLength(StandardAnalyzer.this.maxTokenLength);
        super.setReader(reader);
      }
    };
  }

JapaneseAnalyzerの場合

  @Override
  protected TokenStreamComponents createComponents(String fieldName) {
    Tokenizer tokenizer = new JapaneseTokenizer(userDict, true, mode);
    TokenStream stream = new JapaneseBaseFormFilter(tokenizer);
    stream = new JapanesePartOfSpeechStopFilter(stream, stoptags);
    stream = new CJKWidthFilter(stream);
    stream = new StopFilter(stream, stopwords);
    stream = new JapaneseKatakanaStemFilter(stream);
    stream = new LowerCaseFilter(stream);
    return new TokenStreamComponents(tokenizer, stream);
  }
0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0