Help us understand the problem. What is going on with this article?

Elasticsearch 6 で Synonym Token Filter が直感的になった

More than 1 year has passed since last update.

Analysis changes

Synonym Token Filter
https://www.elastic.co/guide/en/elasticsearch/reference/6.x/breaking_60_analysis_changes.html#_synonym_token_filter

In 6.0, Synonym Token Filter tokenizes synonyms with whatever tokenizer and token filters appear before it in the chain.

シノニムに登録してあるはずなのに、シノニム通りに変換されていない、検索エンジンあるある。
シノニムトークンフィルタの前にどうトークンに分割されたかで(人間的には)意図しない動きをしていたあの事象が解決されたようなので試してみた。

以下を Kibana Dev Tools で実行して 5.x と 6.x の結果を比較してみる

DELETE my_index

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "ja_analyzer": { 
          "type": "custom",
          "char_filter": [
          ],
          "tokenizer": "kuromoji_tokenizer",
          "filter": [
            "lowercase",
            "asciifolding",
            "synonym"
          ]
        }
      },
      "filter" : {
        "synonym" : {
          "type" : "synonym",
          "synonyms" : [
            "i-pod, i pod, アイポッド, あいぽっど => ipod"
          ]
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "my_text": {
          "type": "text",
          "analyzer": "ja_analyzer" 
        }
      }
    }
  }
}

GET my_index/_analyze 
{
  "analyzer": "ja_analyzer", 
  "text":     "I pod アイポッド あいぽっど"
}

Elasticsearch 5.x まで

"I pod アイポッド あいぽっど" の結果を見てみる。

あいぽっど が kuromoji_tokenizer によって あい / ぽ / っ / ど に分割されるので、シノニムに あいぽっど を登録しておいても ipod には変換されなかった。

{
  "tokens": [
    {
      "token": "ipod",
      "start_offset": 0,
      "end_offset": 5,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "ipod",
      "start_offset": 6,
      "end_offset": 11,
      "type": "SYNONYM",
      "position": 1
    },
    {
      "token": "あい",
      "start_offset": 12,
      "end_offset": 14,
      "type": "word",
      "position": 2
    },
    {
      "token": "ぽ",
      "start_offset": 14,
      "end_offset": 15,
      "type": "word",
      "position": 3
    },
    {
      "token": "っ",
      "start_offset": 15,
      "end_offset": 16,
      "type": "word",
      "position": 4
    },
    {
      "token": "ど",
      "start_offset": 16,
      "end_offset": 17,
      "type": "word",
      "position": 5
    }
  ]
}

Elasticsearch 6.x から

"I pod アイポッド あいぽっど" がちゃんと "ipod ipod ipod" になっているのが分かる

{
  "tokens": [
    {
      "token": "ipod",
      "start_offset": 0,
      "end_offset": 5,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "ipod",
      "start_offset": 6,
      "end_offset": 11,
      "type": "SYNONYM",
      "position": 1
    },
    {
      "token": "ipod",
      "start_offset": 12,
      "end_offset": 17,
      "type": "SYNONYM",
      "position": 2
    }
  ]
}

付録 (1) Analyze UI

6.x で使える Analyze UI Plugin でみるとこんな感じ。
Kibana.png

付録 (2) explain true の結果

GET my_index/_analyze 
{
  "explain": true,
  "analyzer": "ja_analyzer", 
  "text":     "I pod アイポッド あいぽっど"
}

5.x


{
  "detail": {
    "custom_analyzer": true,
    "charfilters": [],
    "tokenizer": {
      "name": "kuromoji_tokenizer",
      "tokens": [
        {
          "token": "I",
          "start_offset": 0,
          "end_offset": 1,
          "type": "word",
          "position": 0,
          "baseForm": null,
          "bytes": "[49]",
          "inflectionForm": null,
          "inflectionForm (en)": null,
          "inflectionType": null,
          "inflectionType (en)": null,
          "partOfSpeech": "名詞-固有名詞-組織",
          "partOfSpeech (en)": "noun-proper-organization",
          "positionLength": 1,
          "pronunciation": null,
          "pronunciation (en)": null,
          "reading": null,
          "reading (en)": null
        },
        {
          "token": "pod",
          "start_offset": 2,
          "end_offset": 5,
          "type": "word",
          "position": 1,
          "baseForm": null,
          "bytes": "[70 6f 64]",
          "inflectionForm": null,
          "inflectionForm (en)": null,
          "inflectionType": null,
          "inflectionType (en)": null,
          "partOfSpeech": "名詞-固有名詞-組織",
          "partOfSpeech (en)": "noun-proper-organization",
          "positionLength": 1,
          "pronunciation": null,
          "pronunciation (en)": null,
          "reading": null,
          "reading (en)": null
        },
        {
          "token": "アイポッド",
          "start_offset": 6,
          "end_offset": 11,
          "type": "word",
          "position": 2,
          "baseForm": null,
          "bytes": "[e3 82 a2 e3 82 a4 e3 83 9d e3 83 83 e3 83 89]",
          "inflectionForm": null,
          "inflectionForm (en)": null,
          "inflectionType": null,
          "inflectionType (en)": null,
          "partOfSpeech": "名詞-一般",
          "partOfSpeech (en)": "noun-common",
          "positionLength": 1,
          "pronunciation": null,
          "pronunciation (en)": null,
          "reading": null,
          "reading (en)": null
        },
        {
          "token": "あい",
          "start_offset": 12,
          "end_offset": 14,
          "type": "word",
          "position": 3,
          "baseForm": "あう",
          "bytes": "[e3 81 82 e3 81 84]",
          "inflectionForm": "連用形",
          "inflectionForm (en)": "conjunctive",
          "inflectionType": "五段・ワ行促音便",
          "inflectionType (en)": "5-row-cons-w-cons-onbin",
          "partOfSpeech": "動詞-自立",
          "partOfSpeech (en)": "verb-main",
          "positionLength": 1,
          "pronunciation": "アイ",
          "pronunciation (en)": "ai",
          "reading": "アイ",
          "reading (en)": "ai"
        },
        {
          "token": "ぽ",
          "start_offset": 14,
          "end_offset": 15,
          "type": "word",
          "position": 4,
          "baseForm": "ぽい",
          "bytes": "[e3 81 bd]",
          "inflectionForm": "ガル接続",
          "inflectionForm (en)": "garu-connection",
          "inflectionType": "形容詞・アウオ段",
          "inflectionType (en)": "adj-group-a-o-u",
          "partOfSpeech": "形容詞-接尾",
          "partOfSpeech (en)": "adjective-suffix",
          "positionLength": 1,
          "pronunciation": "ポ",
          "pronunciation (en)": "po",
          "reading": "ポ",
          "reading (en)": "po"
        },
        {
          "token": "っ",
          "start_offset": 15,
          "end_offset": 16,
          "type": "word",
          "position": 5,
          "baseForm": "く",
          "bytes": "[e3 81 a3]",
          "inflectionForm": "連用タ接続",
          "inflectionForm (en)": "conjunctive-ta-connection",
          "inflectionType": "五段・カ行促音便",
          "inflectionType (en)": "5-row-cons-k-cons-onbin",
          "partOfSpeech": "動詞-非自立",
          "partOfSpeech (en)": "verb-auxiliary",
          "positionLength": 1,
          "pronunciation": "ッ",
          "pronunciation (en)": "",
          "reading": "ッ",
          "reading (en)": ""
        },
        {
          "token": "ど",
          "start_offset": 16,
          "end_offset": 17,
          "type": "word",
          "position": 6,
          "baseForm": null,
          "bytes": "[e3 81 a9]",
          "inflectionForm": null,
          "inflectionForm (en)": null,
          "inflectionType": null,
          "inflectionType (en)": null,
          "partOfSpeech": "助詞-接続助詞",
          "partOfSpeech (en)": "particle-conjunctive",
          "positionLength": 1,
          "pronunciation": "ド",
          "pronunciation (en)": "do",
          "reading": "ド",
          "reading (en)": "do"
        }
      ]
    },
    "tokenfilters": [
      {
        "name": "lowercase",
        "tokens": [
          {
            "token": "i",
            "start_offset": 0,
            "end_offset": 1,
            "type": "word",
            "position": 0,
            "baseForm": null,
            "bytes": "[69]",
            "inflectionForm": null,
            "inflectionForm (en)": null,
            "inflectionType": null,
            "inflectionType (en)": null,
            "partOfSpeech": "名詞-固有名詞-組織",
            "partOfSpeech (en)": "noun-proper-organization",
            "positionLength": 1,
            "pronunciation": null,
            "pronunciation (en)": null,
            "reading": null,
            "reading (en)": null
          },
          {
            "token": "pod",
            "start_offset": 2,
            "end_offset": 5,
            "type": "word",
            "position": 1,
            "baseForm": null,
            "bytes": "[70 6f 64]",
            "inflectionForm": null,
            "inflectionForm (en)": null,
            "inflectionType": null,
            "inflectionType (en)": null,
            "partOfSpeech": "名詞-固有名詞-組織",
            "partOfSpeech (en)": "noun-proper-organization",
            "positionLength": 1,
            "pronunciation": null,
            "pronunciation (en)": null,
            "reading": null,
            "reading (en)": null
          },
          {
            "token": "アイポッド",
            "start_offset": 6,
            "end_offset": 11,
            "type": "word",
            "position": 2,
            "baseForm": null,
            "bytes": "[e3 82 a2 e3 82 a4 e3 83 9d e3 83 83 e3 83 89]",
            "inflectionForm": null,
            "inflectionForm (en)": null,
            "inflectionType": null,
            "inflectionType (en)": null,
            "partOfSpeech": "名詞-一般",
            "partOfSpeech (en)": "noun-common",
            "positionLength": 1,
            "pronunciation": null,
            "pronunciation (en)": null,
            "reading": null,
            "reading (en)": null
          },
          {
            "token": "あい",
            "start_offset": 12,
            "end_offset": 14,
            "type": "word",
            "position": 3,
            "baseForm": "あう",
            "bytes": "[e3 81 82 e3 81 84]",
            "inflectionForm": "連用形",
            "inflectionForm (en)": "conjunctive",
            "inflectionType": "五段・ワ行促音便",
            "inflectionType (en)": "5-row-cons-w-cons-onbin",
            "partOfSpeech": "動詞-自立",
            "partOfSpeech (en)": "verb-main",
            "positionLength": 1,
            "pronunciation": "アイ",
            "pronunciation (en)": "ai",
            "reading": "アイ",
            "reading (en)": "ai"
          },
          {
            "token": "ぽ",
            "start_offset": 14,
            "end_offset": 15,
            "type": "word",
            "position": 4,
            "baseForm": "ぽい",
            "bytes": "[e3 81 bd]",
            "inflectionForm": "ガル接続",
            "inflectionForm (en)": "garu-connection",
            "inflectionType": "形容詞・アウオ段",
            "inflectionType (en)": "adj-group-a-o-u",
            "partOfSpeech": "形容詞-接尾",
            "partOfSpeech (en)": "adjective-suffix",
            "positionLength": 1,
            "pronunciation": "ポ",
            "pronunciation (en)": "po",
            "reading": "ポ",
            "reading (en)": "po"
          },
          {
            "token": "っ",
            "start_offset": 15,
            "end_offset": 16,
            "type": "word",
            "position": 5,
            "baseForm": "く",
            "bytes": "[e3 81 a3]",
            "inflectionForm": "連用タ接続",
            "inflectionForm (en)": "conjunctive-ta-connection",
            "inflectionType": "五段・カ行促音便",
            "inflectionType (en)": "5-row-cons-k-cons-onbin",
            "partOfSpeech": "動詞-非自立",
            "partOfSpeech (en)": "verb-auxiliary",
            "positionLength": 1,
            "pronunciation": "ッ",
            "pronunciation (en)": "",
            "reading": "ッ",
            "reading (en)": ""
          },
          {
            "token": "ど",
            "start_offset": 16,
            "end_offset": 17,
            "type": "word",
            "position": 6,
            "baseForm": null,
            "bytes": "[e3 81 a9]",
            "inflectionForm": null,
            "inflectionForm (en)": null,
            "inflectionType": null,
            "inflectionType (en)": null,
            "partOfSpeech": "助詞-接続助詞",
            "partOfSpeech (en)": "particle-conjunctive",
            "positionLength": 1,
            "pronunciation": "ド",
            "pronunciation (en)": "do",
            "reading": "ド",
            "reading (en)": "do"
          }
        ]
      },
      {
        "name": "asciifolding",
        "tokens": [
          {
            "token": "i",
            "start_offset": 0,
            "end_offset": 1,
            "type": "word",
            "position": 0,
            "baseForm": null,
            "bytes": "[69]",
            "inflectionForm": null,
            "inflectionForm (en)": null,
            "inflectionType": null,
            "inflectionType (en)": null,
            "partOfSpeech": "名詞-固有名詞-組織",
            "partOfSpeech (en)": "noun-proper-organization",
            "positionLength": 1,
            "pronunciation": null,
            "pronunciation (en)": null,
            "reading": null,
            "reading (en)": null
          },
          {
            "token": "pod",
            "start_offset": 2,
            "end_offset": 5,
            "type": "word",
            "position": 1,
            "baseForm": null,
            "bytes": "[70 6f 64]",
            "inflectionForm": null,
            "inflectionForm (en)": null,
            "inflectionType": null,
            "inflectionType (en)": null,
            "partOfSpeech": "名詞-固有名詞-組織",
            "partOfSpeech (en)": "noun-proper-organization",
            "positionLength": 1,
            "pronunciation": null,
            "pronunciation (en)": null,
            "reading": null,
            "reading (en)": null
          },
          {
            "token": "アイポッド",
            "start_offset": 6,
            "end_offset": 11,
            "type": "word",
            "position": 2,
            "baseForm": null,
            "bytes": "[e3 82 a2 e3 82 a4 e3 83 9d e3 83 83 e3 83 89]",
            "inflectionForm": null,
            "inflectionForm (en)": null,
            "inflectionType": null,
            "inflectionType (en)": null,
            "partOfSpeech": "名詞-一般",
            "partOfSpeech (en)": "noun-common",
            "positionLength": 1,
            "pronunciation": null,
            "pronunciation (en)": null,
            "reading": null,
            "reading (en)": null
          },
          {
            "token": "あい",
            "start_offset": 12,
            "end_offset": 14,
            "type": "word",
            "position": 3,
            "baseForm": "あう",
            "bytes": "[e3 81 82 e3 81 84]",
            "inflectionForm": "連用形",
            "inflectionForm (en)": "conjunctive",
            "inflectionType": "五段・ワ行促音便",
            "inflectionType (en)": "5-row-cons-w-cons-onbin",
            "partOfSpeech": "動詞-自立",
            "partOfSpeech (en)": "verb-main",
            "positionLength": 1,
            "pronunciation": "アイ",
            "pronunciation (en)": "ai",
            "reading": "アイ",
            "reading (en)": "ai"
          },
          {
            "token": "ぽ",
            "start_offset": 14,
            "end_offset": 15,
            "type": "word",
            "position": 4,
            "baseForm": "ぽい",
            "bytes": "[e3 81 bd]",
            "inflectionForm": "ガル接続",
            "inflectionForm (en)": "garu-connection",
            "inflectionType": "形容詞・アウオ段",
            "inflectionType (en)": "adj-group-a-o-u",
            "partOfSpeech": "形容詞-接尾",
            "partOfSpeech (en)": "adjective-suffix",
            "positionLength": 1,
            "pronunciation": "ポ",
            "pronunciation (en)": "po",
            "reading": "ポ",
            "reading (en)": "po"
          },
          {
            "token": "っ",
            "start_offset": 15,
            "end_offset": 16,
            "type": "word",
            "position": 5,
            "baseForm": "く",
            "bytes": "[e3 81 a3]",
            "inflectionForm": "連用タ接続",
            "inflectionForm (en)": "conjunctive-ta-connection",
            "inflectionType": "五段・カ行促音便",
            "inflectionType (en)": "5-row-cons-k-cons-onbin",
            "partOfSpeech": "動詞-非自立",
            "partOfSpeech (en)": "verb-auxiliary",
            "positionLength": 1,
            "pronunciation": "ッ",
            "pronunciation (en)": "",
            "reading": "ッ",
            "reading (en)": ""
          },
          {
            "token": "ど",
            "start_offset": 16,
            "end_offset": 17,
            "type": "word",
            "position": 6,
            "baseForm": null,
            "bytes": "[e3 81 a9]",
            "inflectionForm": null,
            "inflectionForm (en)": null,
            "inflectionType": null,
            "inflectionType (en)": null,
            "partOfSpeech": "助詞-接続助詞",
            "partOfSpeech (en)": "particle-conjunctive",
            "positionLength": 1,
            "pronunciation": "ド",
            "pronunciation (en)": "do",
            "reading": "ド",
            "reading (en)": "do"
          }
        ]
      },
      {
        "name": "synonym",
        "tokens": [
          {
            "token": "ipod",
            "start_offset": 0,
            "end_offset": 5,
            "type": "SYNONYM",
            "position": 0,
            "baseForm": null,
            "bytes": "[69 70 6f 64]",
            "inflectionForm": null,
            "inflectionForm (en)": null,
            "inflectionType": null,
            "inflectionType (en)": null,
            "partOfSpeech": null,
            "partOfSpeech (en)": null,
            "positionLength": 1,
            "pronunciation": null,
            "pronunciation (en)": null,
            "reading": null,
            "reading (en)": null
          },
          {
            "token": "ipod",
            "start_offset": 6,
            "end_offset": 11,
            "type": "SYNONYM",
            "position": 1,
            "baseForm": null,
            "bytes": "[69 70 6f 64]",
            "inflectionForm": null,
            "inflectionForm (en)": null,
            "inflectionType": null,
            "inflectionType (en)": null,
            "partOfSpeech": null,
            "partOfSpeech (en)": null,
            "positionLength": 1,
            "pronunciation": null,
            "pronunciation (en)": null,
            "reading": null,
            "reading (en)": null
          },
          {
            "token": "あい",
            "start_offset": 12,
            "end_offset": 14,
            "type": "word",
            "position": 2,
            "baseForm": "あう",
            "bytes": "[e3 81 82 e3 81 84]",
            "inflectionForm": "連用形",
            "inflectionForm (en)": "conjunctive",
            "inflectionType": "五段・ワ行促音便",
            "inflectionType (en)": "5-row-cons-w-cons-onbin",
            "partOfSpeech": "動詞-自立",
            "partOfSpeech (en)": "verb-main",
            "positionLength": 1,
            "pronunciation": "アイ",
            "pronunciation (en)": "ai",
            "reading": "アイ",
            "reading (en)": "ai"
          },
          {
            "token": "ぽ",
            "start_offset": 14,
            "end_offset": 15,
            "type": "word",
            "position": 3,
            "baseForm": "ぽい",
            "bytes": "[e3 81 bd]",
            "inflectionForm": "ガル接続",
            "inflectionForm (en)": "garu-connection",
            "inflectionType": "形容詞・アウオ段",
            "inflectionType (en)": "adj-group-a-o-u",
            "partOfSpeech": "形容詞-接尾",
            "partOfSpeech (en)": "adjective-suffix",
            "positionLength": 1,
            "pronunciation": "ポ",
            "pronunciation (en)": "po",
            "reading": "ポ",
            "reading (en)": "po"
          },
          {
            "token": "っ",
            "start_offset": 15,
            "end_offset": 16,
            "type": "word",
            "position": 4,
            "baseForm": "く",
            "bytes": "[e3 81 a3]",
            "inflectionForm": "連用タ接続",
            "inflectionForm (en)": "conjunctive-ta-connection",
            "inflectionType": "五段・カ行促音便",
            "inflectionType (en)": "5-row-cons-k-cons-onbin",
            "partOfSpeech": "動詞-非自立",
            "partOfSpeech (en)": "verb-auxiliary",
            "positionLength": 1,
            "pronunciation": "ッ",
            "pronunciation (en)": "",
            "reading": "ッ",
            "reading (en)": ""
          },
          {
            "token": "ど",
            "start_offset": 16,
            "end_offset": 17,
            "type": "word",
            "position": 5,
            "baseForm": null,
            "bytes": "[e3 81 a9]",
            "inflectionForm": null,
            "inflectionForm (en)": null,
            "inflectionType": null,
            "inflectionType (en)": null,
            "partOfSpeech": "助詞-接続助詞",
            "partOfSpeech (en)": "particle-conjunctive",
            "positionLength": 1,
            "pronunciation": "ド",
            "pronunciation (en)": "do",
            "reading": "ド",
            "reading (en)": "do"
          }
        ]
      }
    ]
  }
}

6.x


{
  "detail": {
    "custom_analyzer": true,
    "charfilters": [],
    "tokenizer": {
      "name": "kuromoji_tokenizer",
      "tokens": [
        {
          "token": "I",
          "start_offset": 0,
          "end_offset": 1,
          "type": "word",
          "position": 0,
          "baseForm": null,
          "bytes": "[49]",
          "inflectionForm": null,
          "inflectionForm (en)": null,
          "inflectionType": null,
          "inflectionType (en)": null,
          "partOfSpeech": "名詞-固有名詞-組織",
          "partOfSpeech (en)": "noun-proper-organization",
          "positionLength": 1,
          "pronunciation": null,
          "pronunciation (en)": null,
          "reading": null,
          "reading (en)": null,
          "termFrequency": 1
        },
        {
          "token": "pod",
          "start_offset": 2,
          "end_offset": 5,
          "type": "word",
          "position": 1,
          "baseForm": null,
          "bytes": "[70 6f 64]",
          "inflectionForm": null,
          "inflectionForm (en)": null,
          "inflectionType": null,
          "inflectionType (en)": null,
          "partOfSpeech": "名詞-固有名詞-組織",
          "partOfSpeech (en)": "noun-proper-organization",
          "positionLength": 1,
          "pronunciation": null,
          "pronunciation (en)": null,
          "reading": null,
          "reading (en)": null,
          "termFrequency": 1
        },
        {
          "token": "アイポッド",
          "start_offset": 6,
          "end_offset": 11,
          "type": "word",
          "position": 2,
          "baseForm": null,
          "bytes": "[e3 82 a2 e3 82 a4 e3 83 9d e3 83 83 e3 83 89]",
          "inflectionForm": null,
          "inflectionForm (en)": null,
          "inflectionType": null,
          "inflectionType (en)": null,
          "partOfSpeech": "名詞-一般",
          "partOfSpeech (en)": "noun-common",
          "positionLength": 1,
          "pronunciation": null,
          "pronunciation (en)": null,
          "reading": null,
          "reading (en)": null,
          "termFrequency": 1
        },
        {
          "token": "あい",
          "start_offset": 12,
          "end_offset": 14,
          "type": "word",
          "position": 3,
          "baseForm": "あう",
          "bytes": "[e3 81 82 e3 81 84]",
          "inflectionForm": "連用形",
          "inflectionForm (en)": "conjunctive",
          "inflectionType": "五段・ワ行促音便",
          "inflectionType (en)": "5-row-cons-w-cons-onbin",
          "partOfSpeech": "動詞-自立",
          "partOfSpeech (en)": "verb-main",
          "positionLength": 1,
          "pronunciation": "アイ",
          "pronunciation (en)": "ai",
          "reading": "アイ",
          "reading (en)": "ai",
          "termFrequency": 1
        },
        {
          "token": "ぽ",
          "start_offset": 14,
          "end_offset": 15,
          "type": "word",
          "position": 4,
          "baseForm": "ぽい",
          "bytes": "[e3 81 bd]",
          "inflectionForm": "ガル接続",
          "inflectionForm (en)": "garu-connection",
          "inflectionType": "形容詞・アウオ段",
          "inflectionType (en)": "adj-group-a-o-u",
          "partOfSpeech": "形容詞-接尾",
          "partOfSpeech (en)": "adjective-suffix",
          "positionLength": 1,
          "pronunciation": "ポ",
          "pronunciation (en)": "po",
          "reading": "ポ",
          "reading (en)": "po",
          "termFrequency": 1
        },
        {
          "token": "っ",
          "start_offset": 15,
          "end_offset": 16,
          "type": "word",
          "position": 5,
          "baseForm": "く",
          "bytes": "[e3 81 a3]",
          "inflectionForm": "連用タ接続",
          "inflectionForm (en)": "conjunctive-ta-connection",
          "inflectionType": "五段・カ行促音便",
          "inflectionType (en)": "5-row-cons-k-cons-onbin",
          "partOfSpeech": "動詞-非自立",
          "partOfSpeech (en)": "verb-auxiliary",
          "positionLength": 1,
          "pronunciation": "ッ",
          "pronunciation (en)": "",
          "reading": "ッ",
          "reading (en)": "",
          "termFrequency": 1
        },
        {
          "token": "ど",
          "start_offset": 16,
          "end_offset": 17,
          "type": "word",
          "position": 6,
          "baseForm": null,
          "bytes": "[e3 81 a9]",
          "inflectionForm": null,
          "inflectionForm (en)": null,
          "inflectionType": null,
          "inflectionType (en)": null,
          "partOfSpeech": "助詞-接続助詞",
          "partOfSpeech (en)": "particle-conjunctive",
          "positionLength": 1,
          "pronunciation": "ド",
          "pronunciation (en)": "do",
          "reading": "ド",
          "reading (en)": "do",
          "termFrequency": 1
        }
      ]
    },
    "tokenfilters": [
      {
        "name": "lowercase",
        "tokens": [
          {
            "token": "i",
            "start_offset": 0,
            "end_offset": 1,
            "type": "word",
            "position": 0,
            "baseForm": null,
            "bytes": "[69]",
            "inflectionForm": null,
            "inflectionForm (en)": null,
            "inflectionType": null,
            "inflectionType (en)": null,
            "partOfSpeech": "名詞-固有名詞-組織",
            "partOfSpeech (en)": "noun-proper-organization",
            "positionLength": 1,
            "pronunciation": null,
            "pronunciation (en)": null,
            "reading": null,
            "reading (en)": null,
            "termFrequency": 1
          },
          {
            "token": "pod",
            "start_offset": 2,
            "end_offset": 5,
            "type": "word",
            "position": 1,
            "baseForm": null,
            "bytes": "[70 6f 64]",
            "inflectionForm": null,
            "inflectionForm (en)": null,
            "inflectionType": null,
            "inflectionType (en)": null,
            "partOfSpeech": "名詞-固有名詞-組織",
            "partOfSpeech (en)": "noun-proper-organization",
            "positionLength": 1,
            "pronunciation": null,
            "pronunciation (en)": null,
            "reading": null,
            "reading (en)": null,
            "termFrequency": 1
          },
          {
            "token": "アイポッド",
            "start_offset": 6,
            "end_offset": 11,
            "type": "word",
            "position": 2,
            "baseForm": null,
            "bytes": "[e3 82 a2 e3 82 a4 e3 83 9d e3 83 83 e3 83 89]",
            "inflectionForm": null,
            "inflectionForm (en)": null,
            "inflectionType": null,
            "inflectionType (en)": null,
            "partOfSpeech": "名詞-一般",
            "partOfSpeech (en)": "noun-common",
            "positionLength": 1,
            "pronunciation": null,
            "pronunciation (en)": null,
            "reading": null,
            "reading (en)": null,
            "termFrequency": 1
          },
          {
            "token": "あい",
            "start_offset": 12,
            "end_offset": 14,
            "type": "word",
            "position": 3,
            "baseForm": "あう",
            "bytes": "[e3 81 82 e3 81 84]",
            "inflectionForm": "連用形",
            "inflectionForm (en)": "conjunctive",
            "inflectionType": "五段・ワ行促音便",
            "inflectionType (en)": "5-row-cons-w-cons-onbin",
            "partOfSpeech": "動詞-自立",
            "partOfSpeech (en)": "verb-main",
            "positionLength": 1,
            "pronunciation": "アイ",
            "pronunciation (en)": "ai",
            "reading": "アイ",
            "reading (en)": "ai",
            "termFrequency": 1
          },
          {
            "token": "ぽ",
            "start_offset": 14,
            "end_offset": 15,
            "type": "word",
            "position": 4,
            "baseForm": "ぽい",
            "bytes": "[e3 81 bd]",
            "inflectionForm": "ガル接続",
            "inflectionForm (en)": "garu-connection",
            "inflectionType": "形容詞・アウオ段",
            "inflectionType (en)": "adj-group-a-o-u",
            "partOfSpeech": "形容詞-接尾",
            "partOfSpeech (en)": "adjective-suffix",
            "positionLength": 1,
            "pronunciation": "ポ",
            "pronunciation (en)": "po",
            "reading": "ポ",
            "reading (en)": "po",
            "termFrequency": 1
          },
          {
            "token": "っ",
            "start_offset": 15,
            "end_offset": 16,
            "type": "word",
            "position": 5,
            "baseForm": "く",
            "bytes": "[e3 81 a3]",
            "inflectionForm": "連用タ接続",
            "inflectionForm (en)": "conjunctive-ta-connection",
            "inflectionType": "五段・カ行促音便",
            "inflectionType (en)": "5-row-cons-k-cons-onbin",
            "partOfSpeech": "動詞-非自立",
            "partOfSpeech (en)": "verb-auxiliary",
            "positionLength": 1,
            "pronunciation": "ッ",
            "pronunciation (en)": "",
            "reading": "ッ",
            "reading (en)": "",
            "termFrequency": 1
          },
          {
            "token": "ど",
            "start_offset": 16,
            "end_offset": 17,
            "type": "word",
            "position": 6,
            "baseForm": null,
            "bytes": "[e3 81 a9]",
            "inflectionForm": null,
            "inflectionForm (en)": null,
            "inflectionType": null,
            "inflectionType (en)": null,
            "partOfSpeech": "助詞-接続助詞",
            "partOfSpeech (en)": "particle-conjunctive",
            "positionLength": 1,
            "pronunciation": "ド",
            "pronunciation (en)": "do",
            "reading": "ド",
            "reading (en)": "do",
            "termFrequency": 1
          }
        ]
      },
      {
        "name": "asciifolding",
        "tokens": [
          {
            "token": "i",
            "start_offset": 0,
            "end_offset": 1,
            "type": "word",
            "position": 0,
            "baseForm": null,
            "bytes": "[69]",
            "inflectionForm": null,
            "inflectionForm (en)": null,
            "inflectionType": null,
            "inflectionType (en)": null,
            "partOfSpeech": "名詞-固有名詞-組織",
            "partOfSpeech (en)": "noun-proper-organization",
            "positionLength": 1,
            "pronunciation": null,
            "pronunciation (en)": null,
            "reading": null,
            "reading (en)": null,
            "termFrequency": 1
          },
          {
            "token": "pod",
            "start_offset": 2,
            "end_offset": 5,
            "type": "word",
            "position": 1,
            "baseForm": null,
            "bytes": "[70 6f 64]",
            "inflectionForm": null,
            "inflectionForm (en)": null,
            "inflectionType": null,
            "inflectionType (en)": null,
            "partOfSpeech": "名詞-固有名詞-組織",
            "partOfSpeech (en)": "noun-proper-organization",
            "positionLength": 1,
            "pronunciation": null,
            "pronunciation (en)": null,
            "reading": null,
            "reading (en)": null,
            "termFrequency": 1
          },
          {
            "token": "アイポッド",
            "start_offset": 6,
            "end_offset": 11,
            "type": "word",
            "position": 2,
            "baseForm": null,
            "bytes": "[e3 82 a2 e3 82 a4 e3 83 9d e3 83 83 e3 83 89]",
            "inflectionForm": null,
            "inflectionForm (en)": null,
            "inflectionType": null,
            "inflectionType (en)": null,
            "partOfSpeech": "名詞-一般",
            "partOfSpeech (en)": "noun-common",
            "positionLength": 1,
            "pronunciation": null,
            "pronunciation (en)": null,
            "reading": null,
            "reading (en)": null,
            "termFrequency": 1
          },
          {
            "token": "あい",
            "start_offset": 12,
            "end_offset": 14,
            "type": "word",
            "position": 3,
            "baseForm": "あう",
            "bytes": "[e3 81 82 e3 81 84]",
            "inflectionForm": "連用形",
            "inflectionForm (en)": "conjunctive",
            "inflectionType": "五段・ワ行促音便",
            "inflectionType (en)": "5-row-cons-w-cons-onbin",
            "partOfSpeech": "動詞-自立",
            "partOfSpeech (en)": "verb-main",
            "positionLength": 1,
            "pronunciation": "アイ",
            "pronunciation (en)": "ai",
            "reading": "アイ",
            "reading (en)": "ai",
            "termFrequency": 1
          },
          {
            "token": "ぽ",
            "start_offset": 14,
            "end_offset": 15,
            "type": "word",
            "position": 4,
            "baseForm": "ぽい",
            "bytes": "[e3 81 bd]",
            "inflectionForm": "ガル接続",
            "inflectionForm (en)": "garu-connection",
            "inflectionType": "形容詞・アウオ段",
            "inflectionType (en)": "adj-group-a-o-u",
            "partOfSpeech": "形容詞-接尾",
            "partOfSpeech (en)": "adjective-suffix",
            "positionLength": 1,
            "pronunciation": "ポ",
            "pronunciation (en)": "po",
            "reading": "ポ",
            "reading (en)": "po",
            "termFrequency": 1
          },
          {
            "token": "っ",
            "start_offset": 15,
            "end_offset": 16,
            "type": "word",
            "position": 5,
            "baseForm": "く",
            "bytes": "[e3 81 a3]",
            "inflectionForm": "連用タ接続",
            "inflectionForm (en)": "conjunctive-ta-connection",
            "inflectionType": "五段・カ行促音便",
            "inflectionType (en)": "5-row-cons-k-cons-onbin",
            "partOfSpeech": "動詞-非自立",
            "partOfSpeech (en)": "verb-auxiliary",
            "positionLength": 1,
            "pronunciation": "ッ",
            "pronunciation (en)": "",
            "reading": "ッ",
            "reading (en)": "",
            "termFrequency": 1
          },
          {
            "token": "ど",
            "start_offset": 16,
            "end_offset": 17,
            "type": "word",
            "position": 6,
            "baseForm": null,
            "bytes": "[e3 81 a9]",
            "inflectionForm": null,
            "inflectionForm (en)": null,
            "inflectionType": null,
            "inflectionType (en)": null,
            "partOfSpeech": "助詞-接続助詞",
            "partOfSpeech (en)": "particle-conjunctive",
            "positionLength": 1,
            "pronunciation": "ド",
            "pronunciation (en)": "do",
            "reading": "ド",
            "reading (en)": "do",
            "termFrequency": 1
          }
        ]
      },
      {
        "name": "synonym",
        "tokens": [
          {
            "token": "ipod",
            "start_offset": 0,
            "end_offset": 5,
            "type": "SYNONYM",
            "position": 0,
            "baseForm": null,
            "bytes": "[69 70 6f 64]",
            "inflectionForm": null,
            "inflectionForm (en)": null,
            "inflectionType": null,
            "inflectionType (en)": null,
            "partOfSpeech": null,
            "partOfSpeech (en)": null,
            "positionLength": 1,
            "pronunciation": null,
            "pronunciation (en)": null,
            "reading": null,
            "reading (en)": null,
            "termFrequency": 1
          },
          {
            "token": "ipod",
            "start_offset": 6,
            "end_offset": 11,
            "type": "SYNONYM",
            "position": 1,
            "baseForm": null,
            "bytes": "[69 70 6f 64]",
            "inflectionForm": null,
            "inflectionForm (en)": null,
            "inflectionType": null,
            "inflectionType (en)": null,
            "partOfSpeech": null,
            "partOfSpeech (en)": null,
            "positionLength": 1,
            "pronunciation": null,
            "pronunciation (en)": null,
            "reading": null,
            "reading (en)": null,
            "termFrequency": 1
          },
          {
            "token": "ipod",
            "start_offset": 12,
            "end_offset": 17,
            "type": "SYNONYM",
            "position": 2,
            "baseForm": null,
            "bytes": "[69 70 6f 64]",
            "inflectionForm": null,
            "inflectionForm (en)": null,
            "inflectionType": null,
            "inflectionType (en)": null,
            "partOfSpeech": null,
            "partOfSpeech (en)": null,
            "positionLength": 1,
            "pronunciation": null,
            "pronunciation (en)": null,
            "reading": null,
            "reading (en)": null,
            "termFrequency": 1
          }
        ]
      }
    ]
  }
}

「デジタル一眼レフカメラ」問題を 6.x でシノニムだけで解決する

参考

Solr + kuromoji で単語の切れ方がおかしかったのでガッツリ調べてみた、理由と調べ方その方法を公開します! - よしだのブログ
http://blog.yoslab.com/entry/2014/09/12/005207

「一眼レフカメラ」を kuromoji で形態素解析すると…

Surface form Part-of-Speech Base form Reading Pronunciation
一眼 名詞,一般,, 一眼 イチガン
レフ 名詞,一般,, レフ レフ
カメラ 名詞,一般,, カメラ カメラ

コレに「デジタル」をつけると…

Surface form Part-of-Speech Base form Reading Pronunciation
デジタル 名詞,一般,, デジタル デジタル
名詞,数,, イチ
名詞,一般,,
レフ 名詞,一般,, レフ レフ
カメラ 名詞,一般,, カメラ カメラ

デジタル / イチ / メ / レフ / カメラ になる問題。

これを kuromoji ユーザー辞書 を使わずに検索に引っかかるようにできる。

案1「"デジタル一眼 => デジタル, 一眼"」(ただしハイライト位置がずれる)

[1]
DELETE my_index
[2]
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "ja_analyzer": {
          "type": "custom",
          "char_filter": [
          ],
          "tokenizer": "kuromoji_tokenizer",
          "filter": [
            "lowercase",
            "asciifolding",
            "synonym"
          ]
        }
      },
      "filter" : {
        "synonym" : {
          "type" : "synonym",
          "synonyms" : [
            "デジタル一眼 => デジタル, 一眼"
          ]
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "my_text": {
          "type": "text",
          "analyzer": "ja_analyzer"
        }
      }
    }
  }
}
[3]
GET my_index/_analyze
{
  "analyzer": "ja_analyzer",
  "text":     "デジタル一眼レフカメラ 一眼レフカメラ"
}
[4]
POST my_index/my_type/1
{
  "my_text" : "一眼レフカメラが欲しい"
}
[5]
GET my_index/_search
{
  "query": {
    "match_phrase": {
      "my_text": "一眼"
    }
  },
  "highlight" : {
    "fields" : {
      "my_text" : {}
    }
  }
}
[6]
PUT my_index/my_type/2
{
  "my_text" : "デジタル一眼レフカメラが欲しい"
}
[7]
GET my_index/_search
{
  "query": {
    "match_phrase": {
      "my_text": "一眼"
    }
  },
  "highlight" : {
    "fields" : {
      "my_text" : {}
    }
  }
}

"デジタル一眼 => デジタル, 一眼" というシノニムがあると [7] の結果は以下の通りになる。

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.30873197,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 0.30873197,
        "_source": {
          "my_text": "デジタル一眼レフカメラが欲しい"
        },
        "highlight": {
          "my_text": [
            "<em>デジタル一眼</em>レフカメラが欲しい"
          ]
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "my_text": "一眼レフカメラが欲しい"
        },
        "highlight": {
          "my_text": [
            "<em>一眼</em>レフカメラが欲しい"
          ]
        }
      }
    ]
  }
}

検索には引っかかるが、「一眼」で検索しても「デジタル一眼」がハイライトされる。

これは

GET my_index/_analyze
{
  "analyzer": "ja_analyzer",
  "text":     "デジタル一眼レフカメラ"
}

をみると

{
  "tokens": [
    {
      "token": "デジタル",
      "start_offset": 0,
      "end_offset": 6,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "一眼",
      "start_offset": 0,
      "end_offset": 6,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "レフ",
      "start_offset": 6,
      "end_offset": 8,
      "type": "word",
      "position": 1
    },
    {
      "token": "カメラ",
      "start_offset": 8,
      "end_offset": 11,
      "type": "word",
      "position": 2
    }
  ]
}

「一眼」のオフセットが「"start_offset": 0, "end_offset": 6,」 になっているためハイライト位置がズレる。

案2「"一 眼 => 一眼"」

※これは Elasticsearch 5.x でも 6.x でも出来る方法。

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "ja_analyzer": {
          "type": "custom",
          "char_filter": [
          ],
          "tokenizer": "kuromoji_tokenizer",
          "filter": [
            "lowercase",
            "asciifolding",
            "graph_synonym"
          ]
        }
      },
      "filter" : {
        "graph_synonym" : {
          "type" : "synonym_graph",
          "synonyms" : [
            "一 眼 => 一眼"
          ]
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "my_text": {
          "type": "text",
          "analyzer": "ja_analyzer"
        }
      }
    }
  }
}

とすると

GET my_index/_analyze
{
  "analyzer": "ja_analyzer",
  "text":     "デジタル一眼レフカメラ"
}

{
  "tokens": [
    {
      "token": "デジタル",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    },
    {
      "token": "一眼",
      "start_offset": 4,
      "end_offset": 6,
      "type": "SYNONYM",
      "position": 1
    },
    {
      "token": "レフ",
      "start_offset": 6,
      "end_offset": 8,
      "type": "word",
      "position": 2
    },
    {
      "token": "カメラ",
      "start_offset": 8,
      "end_offset": 11,
      "type": "word",
      "position": 3
    }
  ]
}

一眼のposition, offsetが正しくなるので

GET my_index/_search
{
  "query": {
    "match_phrase": {
      "my_text": "一眼"
    }
  },
  "highlight" : {
    "fields" : {
      "my_text" : {}
    }
  }
}

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 0.2876821,
        "_source": {
          "my_text": "デジタル一眼レフカメラが欲しい"
        },
        "highlight": {
          "my_text": [
            "デジタル<em>一眼</em>レフカメラが欲しい"
          ]
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "my_text": "一眼レフカメラが欲しい"
        },
        "highlight": {
          "my_text": [
            "<em>一眼</em>レフカメラが欲しい"
          ]
        }
      }
    ]
  }
}

ハイライト位置も正しく、2件とも検索に引っかかる。

これで困るのは、仮に「金田一眼」なんて名前の人が居て「金田一 / 眼」「金田 / 一眼」 なのか
ちょっと曖昧にしたい場合でも「金田 / 一眼」 にしかならないこと。

{
  "tokens": [
    {
      "token": "金田",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "一眼",
      "start_offset": 2,
      "end_offset": 4,
      "type": "word",
      "position": 1
    }
  ]
}

※「金田一」だけだとどちらでも良いように出る。

{
  "tokens": [
    {
      "token": "金田",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "金田一",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 0,
      "positionLength": 2
    },
    {
      "token": "一",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 1
    }
  ]
}
Why not register and get more from Qiita?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
Comments
No comments
Sign up for free and join this conversation.
If you already have a Qiita account
Why do not you register as a user and use Qiita more conveniently?
You need to log in to use this function. Qiita can be used more conveniently after logging in.
You seem to be reading articles frequently this month. Qiita can be used more conveniently after logging in.
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
ユーザーは見つかりませんでした