ElasticSearchで検索のプロになる！07.Analyzeの話 #Elasticsearch

検索を実行していると、これ本当に検索結果あってるの？？
というのがある。

厳密には結果に対する疑問というより、検索結果と自分の検索ロジックとに違和感があるという漠然とした感覚です。

RDMSでは実行計画というのがあって、クエリが自分の意図した動きをしているかが確認できる。

ElasticSearchでもそんな機能があって、_analyzeを指定するとそんな結果が出る。

curl -XGET 'http://localhost:9200/search/_analyze?pretty' -d '{"size":100,"from":0,"query":{"query_string":{"query":"text-ja-ma:(\"魔法少女まどか☆マギカ\")^100 OR text-ja-2gram:(\"魔法少女まどか☆マギカ\")^10"}}}'

実行結果

  }, {
    "token" : "魔",
    "start_offset" : 69,
    "end_offset" : 70,
    "type" : "<IDEOGRAPHIC>",
    "position" : 11
  }, {
    "token" : "法",
    "start_offset" : 70,
    "end_offset" : 71,
    "type" : "<IDEOGRAPHIC>",
    "position" : 12
  }, {
    "token" : "少",
    "start_offset" : 71,
    "end_offset" : 72,
    "type" : "<IDEOGRAPHIC>",
    "position" : 13
  }, {
    "token" : "女",
    "start_offset" : 72,
    "end_offset" : 73,
    "type" : "<IDEOGRAPHIC>",
    "position" : 14
  }, {
    "token" : "ま",
    "start_offset" : 73,
    "end_offset" : 74,
    "type" : "<HIRAGANA>",
    "position" : 15
  }, {
    "token" : "ど",
    "start_offset" : 74,
    "end_offset" : 75,
    "type" : "<HIRAGANA>",
    "position" : 16
  }, {
    "token" : "か",
    "start_offset" : 75,
    "end_offset" : 76,
    "type" : "<HIRAGANA>",
    "position" : 17
  }, {
    "token" : "マギカ",
    "start_offset" : 77,
    "end_offset" : 80,
    "type" : "<KATAKANA>",
    "position" : 18
  }, {

なんか、Analyzerが正常に効いていない気がする。

どうやらAnalyzerの設定が間違っているのかなと思って、_analyzeに明示的にAnalyzerを指定できるとのことで、今回設定したものを指定してみる。
まずは、形態素

curl -XGET 'http://localhost:9200/search/_analyze?analyzer=ja-ma-analyzer&pretty' -d '{"size":100,"from":0,"query":{"query_string":{"query":"text-ja-ma:(\"魔法少女まどか☆マギカ\")^100 OR text-ja-2gram:(\"魔法少女まどか☆マギカ\")^10"}}}'

これを実行

  }, {
    "token" : "魔法",
    "start_offset" : 108,
    "end_offset" : 110,
    "type" : "word",
    "position" : 22
  }, {
    "token" : "少女",
    "start_offset" : 110,
    "end_offset" : 112,
    "type" : "word",
    "position" : 23
  }, {
    "token" : "まどか",
    "start_offset" : 112,
    "end_offset" : 115,
    "type" : "word",
    "position" : 24
  }, {
    "token" : "マギカ",
    "start_offset" : 116,
    "end_offset" : 119,
    "type" : "word",
    "position" : 25
  }, {

なんとなく、これなら納得の結果。
次にn-GRAM(バイグラム)

curl -XGET 'http://localhost:9200/search/_analyze?analyzer=ja-2gram-analyzer&pretty' -d '{"size":100,"from":0,"query":{"query_string":{"query":"text-ja-ma:(\"魔法少女まどか☆マギカ\")^100 OR text-ja-2gram:(\"魔法少女まどか☆マギカ\")^10"}}}'

実行結果

}, {
    "token" : "魔法",
    "start_offset" : 108,
    "end_offset" : 110,
    "type" : "word",
    "position" : 109
  }, {
    "token" : "法少",
    "start_offset" : 109,
    "end_offset" : 111,
    "type" : "word",
    "position" : 110
  }, {
    "token" : "少女",
    "start_offset" : 110,
    "end_offset" : 112,
    "type" : "word",
    "position" : 111
  }, {
    "token" : "女ま",
    "start_offset" : 111,
    "end_offset" : 113,
    "type" : "word",
    "position" : 112
  }, {
    "token" : "まど",
    "start_offset" : 112,
    "end_offset" : 114,
    "type" : "word",
    "position" : 113
  }, {
    "token" : "どか",
    "start_offset" : 113,
    "end_offset" : 115,
    "type" : "word",
    "position" : 114
  }, {
    "token" : "か☆",
    "start_offset" : 114,
    "end_offset" : 116,
    "type" : "word",
    "position" : 115
  }, {
    "token" : "☆マ",
    "start_offset" : 115,
    "end_offset" : 117,
    "type" : "word",
    "position" : 116
  }, {
    "token" : "マギ",
    "start_offset" : 116,
    "end_offset" : 118,
    "type" : "word",
    "position" : 117
  }, {
    "token" : "ギカ",
    "start_offset" : 117,
    "end_offset" : 119,
    "type" : "word",
    "position" : 118
  }, {
    "token" : "カ\\",
    "start_offset" : 118,
    "end_offset" : 120,
    "type" : "word",
    "position" : 119
  }, {

きっちり、n-Gramの形式になっている、なので、どうやらAnalyzerの設定は正常らしいということが分かります。
となると、
①ダイナミックマッピングが効いていない
②クエリの方法がまずい

のどちらかのきがするので、まずはテンプレートを見直してみる。

①ダイナミックマッピングの確認
以下のリクエストで設定が確認できるので確認してみる。
まずはanalyzerの確認

curl -XGET http://localhost:9200/search/_settings?pretty

結果

{
  "search" : {
    "settings" : {
      "index" : {
        "creation_date" : "1444197536551",
        "analysis" : {
          "filter" : {
            "greek_lowercase_filter" : {
              "type" : "lowercase",
              "language" : "greek"
            }
          },
          "analyzer" : {
            "ja-ma-analyzer" : {
              "filter" : [ "kuromoji_baseform", "greek_lowercase_filter", "cjk_width" ],
              "type" : "custom",
              "tokenizer" : "ja-ma-tokenizer"
            },
            "ja-2gram-analyzer" : {
              "type" : "custom",
              "filter" : [ "greek_lowercase_filter", "cjk_width" ],
              "tokenizer" : "ja-2gram-tokenizer"
            }
          },
          "tokenizer" : {
            "ja-2gram-tokenizer" : {
              "type" : "nGram",
              "min_gram" : "2",
              "max_gram" : "2"
            },
            "ja-ma-tokenizer" : {
              "type" : "kuromoji_tokenizer",
              "mode" : "normal"
            }
          }
        },
        "number_of_shards" : "5",
        "number_of_replicas" : "1",
        "version" : {
          "created" : "1070199"
        },
        "uuid" : "p5NP1VxfRKO2oPhQM1mavg"
      }
    }
  }
}

ちゃんと設定されていることが分かります。
なので、今度はマッピングの確認

curl -XGET http://localhost:9200/search/_mappings?pretty

実行結果

{
  "search" : {
    "mappings" : {
      "_default_" : {
        "dynamic_templates" : [ {
          "search_text_for_ma" : {
            "mapping" : {
              "analyzer" : "ja-ma-analyzer",
              "store" : "no",
              "type" : "string"
            },
            "match" : "text-ja-ma"
          }
        }, {
          "search_text_for_2gram" : {
            "mapping" : {
              "analyzer" : "ja-2gram-analyzer",
              "store" : "no",
              "type" : "string"
            },
            "match" : "text-ja-2gram"
          }
        } ],
        "properties" : { }
      },
      "title" : {
        "dynamic_templates" : [ {
          "search_text_for_ma" : {
            "mapping" : {
              "analyzer" : "ja-ma-analyzer",
              "store" : "no",
              "type" : "string"
            },
            "match" : "text-ja-ma"
          }
        }, {
          "search_text_for_2gram" : {
            "mapping" : {
              "analyzer" : "ja-2gram-analyzer",
              "store" : "no",
              "type" : "string"
            },
            "match" : "text-ja-2gram"
          }
        } ],
        "properties" : {
          "category" : {
            "type" : "string"
          },
          "sequenceKey" : {
            "type" : "string"
          },
          "text-ja-2gram" : {
            "type" : "string",
            "analyzer" : "ja-2gram-analyzer"
          },
          "text-ja-ma" : {
            "type" : "string",
            "analyzer" : "ja-ma-analyzer"
          },
          "text1" : {
            "type" : "string"
          },
          "text2" : {
            "type" : "string"
          },
          "text3" : {
            "type" : "string"
          },
          "text4" : {
            "type" : "string"
          }
        }
      }
    }
  }
}

きちんと設定されている。
なので、やはりクエリの書き方がまずいのかもしれない。

ということで、次回はクエリに関して色々試してみることにします。