Help us understand the problem. What is going on with this article?

ElasticSearchで検索のプロになる!07.Analyzeの話

More than 3 years have passed since last update.

検索を実行していると、これ本当に検索結果あってるの??
というのがある。

厳密には結果に対する疑問というより、検索結果と自分の検索ロジックとに違和感があるという漠然とした感覚です。

RDMSでは実行計画というのがあって、クエリが自分の意図した動きをしているかが確認できる。

ElasticSearchでもそんな機能があって、_analyzeを指定するとそんな結果が出る。

curl -XGET 'http://localhost:9200/search/_analyze?pretty' -d '{"size":100,"from":0,"query":{"query_string":{"query":"text-ja-ma:(\"魔法少女まどか☆マギカ\")^100 OR text-ja-2gram:(\"魔法少女まどか☆マギカ\")^10"}}}'

実行結果

  }, {
    "token" : "魔",
    "start_offset" : 69,
    "end_offset" : 70,
    "type" : "<IDEOGRAPHIC>",
    "position" : 11
  }, {
    "token" : "法",
    "start_offset" : 70,
    "end_offset" : 71,
    "type" : "<IDEOGRAPHIC>",
    "position" : 12
  }, {
    "token" : "少",
    "start_offset" : 71,
    "end_offset" : 72,
    "type" : "<IDEOGRAPHIC>",
    "position" : 13
  }, {
    "token" : "女",
    "start_offset" : 72,
    "end_offset" : 73,
    "type" : "<IDEOGRAPHIC>",
    "position" : 14
  }, {
    "token" : "ま",
    "start_offset" : 73,
    "end_offset" : 74,
    "type" : "<HIRAGANA>",
    "position" : 15
  }, {
    "token" : "ど",
    "start_offset" : 74,
    "end_offset" : 75,
    "type" : "<HIRAGANA>",
    "position" : 16
  }, {
    "token" : "か",
    "start_offset" : 75,
    "end_offset" : 76,
    "type" : "<HIRAGANA>",
    "position" : 17
  }, {
    "token" : "マギカ",
    "start_offset" : 77,
    "end_offset" : 80,
    "type" : "<KATAKANA>",
    "position" : 18
  }, {

なんか、Analyzerが正常に効いていない気がする。

どうやらAnalyzerの設定が間違っているのかなと思って、_analyzeに明示的にAnalyzerを指定できるとのことで、今回設定したものを指定してみる。
まずは、形態素

curl -XGET 'http://localhost:9200/search/_analyze?analyzer=ja-ma-analyzer&pretty' -d '{"size":100,"from":0,"query":{"query_string":{"query":"text-ja-ma:(\"魔法少女まどか☆マギカ\")^100 OR text-ja-2gram:(\"魔法少女まどか☆マギカ\")^10"}}}'

これを実行

  }, {
    "token" : "魔法",
    "start_offset" : 108,
    "end_offset" : 110,
    "type" : "word",
    "position" : 22
  }, {
    "token" : "少女",
    "start_offset" : 110,
    "end_offset" : 112,
    "type" : "word",
    "position" : 23
  }, {
    "token" : "まどか",
    "start_offset" : 112,
    "end_offset" : 115,
    "type" : "word",
    "position" : 24
  }, {
    "token" : "マギカ",
    "start_offset" : 116,
    "end_offset" : 119,
    "type" : "word",
    "position" : 25
  }, {

なんとなく、これなら納得の結果。
次にn-GRAM(バイグラム)

curl -XGET 'http://localhost:9200/search/_analyze?analyzer=ja-2gram-analyzer&pretty' -d '{"size":100,"from":0,"query":{"query_string":{"query":"text-ja-ma:(\"魔法少女まどか☆マギカ\")^100 OR text-ja-2gram:(\"魔法少女まどか☆マギカ\")^10"}}}'

実行結果

}, {
    "token" : "魔法",
    "start_offset" : 108,
    "end_offset" : 110,
    "type" : "word",
    "position" : 109
  }, {
    "token" : "法少",
    "start_offset" : 109,
    "end_offset" : 111,
    "type" : "word",
    "position" : 110
  }, {
    "token" : "少女",
    "start_offset" : 110,
    "end_offset" : 112,
    "type" : "word",
    "position" : 111
  }, {
    "token" : "女ま",
    "start_offset" : 111,
    "end_offset" : 113,
    "type" : "word",
    "position" : 112
  }, {
    "token" : "まど",
    "start_offset" : 112,
    "end_offset" : 114,
    "type" : "word",
    "position" : 113
  }, {
    "token" : "どか",
    "start_offset" : 113,
    "end_offset" : 115,
    "type" : "word",
    "position" : 114
  }, {
    "token" : "か☆",
    "start_offset" : 114,
    "end_offset" : 116,
    "type" : "word",
    "position" : 115
  }, {
    "token" : "☆マ",
    "start_offset" : 115,
    "end_offset" : 117,
    "type" : "word",
    "position" : 116
  }, {
    "token" : "マギ",
    "start_offset" : 116,
    "end_offset" : 118,
    "type" : "word",
    "position" : 117
  }, {
    "token" : "ギカ",
    "start_offset" : 117,
    "end_offset" : 119,
    "type" : "word",
    "position" : 118
  }, {
    "token" : "カ\\",
    "start_offset" : 118,
    "end_offset" : 120,
    "type" : "word",
    "position" : 119
  }, {

きっちり、n-Gramの形式になっている、なので、どうやらAnalyzerの設定は正常らしいということが分かります。
となると、
①ダイナミックマッピングが効いていない
②クエリの方法がまずい

のどちらかのきがするので、まずはテンプレートを見直してみる。

①ダイナミックマッピングの確認
以下のリクエストで設定が確認できるので確認してみる。
まずはanalyzerの確認

curl -XGET http://localhost:9200/search/_settings?pretty

結果

{
  "search" : {
    "settings" : {
      "index" : {
        "creation_date" : "1444197536551",
        "analysis" : {
          "filter" : {
            "greek_lowercase_filter" : {
              "type" : "lowercase",
              "language" : "greek"
            }
          },
          "analyzer" : {
            "ja-ma-analyzer" : {
              "filter" : [ "kuromoji_baseform", "greek_lowercase_filter", "cjk_width" ],
              "type" : "custom",
              "tokenizer" : "ja-ma-tokenizer"
            },
            "ja-2gram-analyzer" : {
              "type" : "custom",
              "filter" : [ "greek_lowercase_filter", "cjk_width" ],
              "tokenizer" : "ja-2gram-tokenizer"
            }
          },
          "tokenizer" : {
            "ja-2gram-tokenizer" : {
              "type" : "nGram",
              "min_gram" : "2",
              "max_gram" : "2"
            },
            "ja-ma-tokenizer" : {
              "type" : "kuromoji_tokenizer",
              "mode" : "normal"
            }
          }
        },
        "number_of_shards" : "5",
        "number_of_replicas" : "1",
        "version" : {
          "created" : "1070199"
        },
        "uuid" : "p5NP1VxfRKO2oPhQM1mavg"
      }
    }
  }
}

ちゃんと設定されていることが分かります。
なので、今度はマッピングの確認

curl -XGET http://localhost:9200/search/_mappings?pretty

実行結果

{
  "search" : {
    "mappings" : {
      "_default_" : {
        "dynamic_templates" : [ {
          "search_text_for_ma" : {
            "mapping" : {
              "analyzer" : "ja-ma-analyzer",
              "store" : "no",
              "type" : "string"
            },
            "match" : "text-ja-ma"
          }
        }, {
          "search_text_for_2gram" : {
            "mapping" : {
              "analyzer" : "ja-2gram-analyzer",
              "store" : "no",
              "type" : "string"
            },
            "match" : "text-ja-2gram"
          }
        } ],
        "properties" : { }
      },
      "title" : {
        "dynamic_templates" : [ {
          "search_text_for_ma" : {
            "mapping" : {
              "analyzer" : "ja-ma-analyzer",
              "store" : "no",
              "type" : "string"
            },
            "match" : "text-ja-ma"
          }
        }, {
          "search_text_for_2gram" : {
            "mapping" : {
              "analyzer" : "ja-2gram-analyzer",
              "store" : "no",
              "type" : "string"
            },
            "match" : "text-ja-2gram"
          }
        } ],
        "properties" : {
          "category" : {
            "type" : "string"
          },
          "sequenceKey" : {
            "type" : "string"
          },
          "text-ja-2gram" : {
            "type" : "string",
            "analyzer" : "ja-2gram-analyzer"
          },
          "text-ja-ma" : {
            "type" : "string",
            "analyzer" : "ja-ma-analyzer"
          },
          "text1" : {
            "type" : "string"
          },
          "text2" : {
            "type" : "string"
          },
          "text3" : {
            "type" : "string"
          },
          "text4" : {
            "type" : "string"
          }
        }
      }
    }
  }
}

きちんと設定されている。
なので、やはりクエリの書き方がまずいのかもしれない。

ということで、次回はクエリに関して色々試してみることにします。

Why not register and get more from Qiita?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
Comments
No comments
Sign up for free and join this conversation.
If you already have a Qiita account
Why do not you register as a user and use Qiita more conveniently?
You need to log in to use this function. Qiita can be used more conveniently after logging in.
You seem to be reading articles frequently this month. Qiita can be used more conveniently after logging in.
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
ユーザーは見つかりませんでした