4
4

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

ElasticsearchでICU AnalysisのAnalyzer, Tokenizer, Token Filters, Char Filtersの一覧

Posted at

Tokenizer

  • ICU Tokenizer: standard tokenizeのように振る舞うが、辞書ベースで、一部アジアの言語にも対応している。

それでは試してみます。以下をみるとなんとなくできていそうです。ただ、辞書ベースのためおそらく新語などには対応できないと思います。

# クエリ
GET _analyze
{
  "tokenizer" : "icu_tokenizer",
  "token_filter" : [],
  "char_filter" : [],
  "text" : "わたしは猫である"
}

# 結果
{
  "tokens": [
    {
      "token": "わたし",
      "start_offset": 0,
      "end_offset": 3,
      "type": "<IDEOGRAPHIC>",
      "position": 0
    },
    {
      "token": "は",
      "start_offset": 3,
      "end_offset": 4,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    },
    {
      "token": "猫",
      "start_offset": 4,
      "end_offset": 5,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "で",
      "start_offset": 5,
      "end_offset": 6,
      "type": "<IDEOGRAPHIC>",
      "position": 3
    },
    {
      "token": "ある",
      "start_offset": 6,
      "end_offset": 8,
      "type": "<IDEOGRAPHIC>",
      "position": 4
    }
  ]
}

Token Filter

  • ICU Normalization Token Filter: ICU Normalization Character FilterのToken Filter版
  • ICU Folding Token Filter: ùとかᴁのようなLatin Extendedに配置されてそうな文字とかをBasic Latinに変換する。さらに、日本語の濁点なども消える。
  • ICU Collation Token Filter: ソートのための照合順序設定
  • ICU Transform Token Filter: ひらがなをローマ字に変換できたりする

Character Filter

  • ICU Normalization Character Filter
    • 使用できるnormalizerのタイプは以下
      • nfkc_cf(デフォルト)
      • nfc(Normalization Form C)
      • nfkc(Normalization Form KC)

参考

4
4
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
4
4

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?