More than 5 years have passed since last update.

Elasticsearch　Analyze APIでkuromoji形態素解析を試す

Elasticsearch

Last updated at 2016-12-18Posted at 2016-12-18

概要

kuromoji-tokenizerって、モードがnormal、search、extendedと３つあるわけですが、これをConsoleからAnalyze APIを叩いて
結果を見たいとき、「モード」ってどう設定したらいいんだっけ？というときにやったことメモ

kuromoji_tokenizer

Analyze　APIの確認

5.1では、このように紹介されています。
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html

Performs the analysis process on a text and return the tokens breakdown of the text.
Can be used without specifying an index against one of the many built in analyzers

curl -XGET 'localhost:9200/_analyze' -d '
{
  "analyzer" : "standard",
  "text" : "this is a test"
}'

If text parameter is provided as array of strings, it is analyzed as a multi-valued field.

curl -XGET 'localhost:9200/_analyze' -d '
{
  "analyzer" : "standard",
  "text" : ["this is a test", "the second text"]
}'

Or by building a custom transient analyzer out of tokenizers, token filters and char filters. Token filters can use the shorter filter parameter name:

curl -XGET 'localhost:9200/_analyze' -d '
{
  "tokenizer" : "keyword",
  "filter" : ["lowercase"],
  "text" : "this is a test"
}'
curl -XGET 'localhost:9200/_analyze' -d '
{
  "tokenizer" : "keyword",
  "filter" : ["lowercase"],
  "char_filter" : ["html_strip"],
  "text" : "this is a <b>test</b>"
}'

どうやら、大きくはanalyzerを設定するか、tokenizerを指定するか（オプション含む）で考えればよさそうです。

tokenizerを指定してみる

このやり方ですと、Consoleの入力補完に頼った限りではModeの指定は難しそうです。

入力

GET _analyze
{
  "tokenizer" : "kuromoji_tokenizer",
  "text" : "関西国際空港"
}

結果

どうやら、modeはsearchの結果が返ってきました。

{
  "tokens": [
    {
      "token": "関西",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "関西国際空港",
      "start_offset": 0,
      "end_offset": 6,
      "type": "word",
      "position": 0
    },
    {
      "token": "国際",
      "start_offset": 2,
      "end_offset": 4,
      "type": "word",
      "position": 1
    },
    {
      "token": "空港",
      "start_offset": 4,
      "end_offset": 6,
      "type": "word",
      "position": 2
    }
  ]
}

Analyzerを指定してみる

kuromoji_tokenizerを使うようなAnalyzerを作り、このAnalyzerをAnalyzer APIで指定することにします。
今回は、モードごとの結果を得たいわけなので、モードがnormalのもの、searchのもの、extendedのもの、と計３つのAnalyzerを作ることにしました。

AnalyzerとTokenizerの関係は、こちらが非常にわかりやすいのでご紹介。
Elasticsearchのanalyzerの設定の基礎

Analyzerの作成

qiitaという仮のIndexを作って、そこでsettingsでAnalyzerの設定をしておきます。

今回は純粋なkuromoji_tokenizerのモードごとによる違いをAnalyze APIで確認したいだけなので、余計なfilterは入れずにやってます。

PUT qiita
{
  "settings": {
    "analysis": {
      "analyzer": {
        "ja-normal-analyzer": {
          "type": "custom",
          "tokenizer": "ja-normal-tokenizer"
        },
        "ja-search-analyzer": {
          "type": "custom",
          "tokenizer": "ja-search-tokenizer"
        },
        "ja-extended-analyzer": {
          "type": "custom",
          "tokenizer": "ja-extended-tokenizer"
        }
      },
      "tokenizer": {
        "ja-normal-tokenizer": {
          "type": "kuromoji_tokenizer",
          "mode": "normal"
        },
        "ja-search-tokenizer": {
          "type": "kuromoji_tokenizer",
          "mode": "search"
        },
        "ja-extended-tokenizer": {
          "type": "kuromoji_tokenizer",
          "mode": "extended"
        }
      }
    }
  }
}

normalの例

先に作ったqiitaの下でAnalyzeを叩きます。

POST qiita/_analyze
{
  "analyzer": "ja-normal-analyzer",
  "text" : "関西国際空港"
}

結果はこちら。　マニュアルに記載の通り、1トークンで返されました。

{
  "tokens": [
    {
      "token": "関西国際空港",
      "start_offset": 0,
      "end_offset": 6,
      "type": "word",
      "position": 0
    }
  ]
}

searchの例

POST qiita/_analyze
{
  "analyzer": "ja-search-analyzer",
  "text" : "関西国際空港"
  
}

{
  "tokens": [
    {
      "token": "関西",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "関西国際空港",
      "start_offset": 0,
      "end_offset": 6,
      "type": "word",
      "position": 0
    },
    {
      "token": "国際",
      "start_offset": 2,
      "end_offset": 4,
      "type": "word",
      "position": 1
    },
    {
      "token": "空港",
      "start_offset": 4,
      "end_offset": 6,
      "type": "word",
      "position": 2
    }
  ]
}

extendedの例

POST qiita/_analyze
{
  "analyzer": "ja-extended-analyzer",
  "text" : "関西国際空港"
}

{
  "tokens": [
    {
      "token": "関西",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "国際",
      "start_offset": 2,
      "end_offset": 4,
      "type": "word",
      "position": 1
    },
    {
      "token": "空港",
      "start_offset": 4,
      "end_offset": 6,
      "type": "word",
      "position": 2
    }
  ]
}

おまけ

"explain": true　を追加してやると、トークンだけでなく読みや品詞といった情報もとることができます。

POST qiita/_analyze
{
  "analyzer": "ja-extended-analyzer",
  "text" : "関西国際空港",
  "explain": true 
}

{
  "detail": {
    "custom_analyzer": true,
    "charfilters": [],
    "tokenizer": {
      "name": "ja-extended-tokenizer",
      "tokens": [
        {
          "token": "関西",
          "start_offset": 0,
          "end_offset": 2,
          "type": "word",
          "position": 0,
          "baseForm": null,
          "bytes": "[e9 96 a2 e8 a5 bf]",
          "inflectionForm": null,
          "inflectionForm (en)": null,
          "inflectionType": null,
          "inflectionType (en)": null,
          "partOfSpeech": "名詞-固有名詞-地域-一般",
          "partOfSpeech (en)": "noun-proper-place-misc",
          "positionLength": 1,
          "pronunciation": "カンサイ",
          "pronunciation (en)": "kansai",
          "reading": "カンサイ",
          "reading (en)": "kansai"
        },
        {
          "token": "国際",
          "start_offset": 2,
          "end_offset": 4,
          "type": "word",
          "position": 1,
          "baseForm": null,
          "bytes": "[e5 9b bd e9 9a 9b]",
          "inflectionForm": null,
          "inflectionForm (en)": null,
          "inflectionType": null,
          "inflectionType (en)": null,
          "partOfSpeech": "名詞-一般",
          "partOfSpeech (en)": "noun-common",
          "positionLength": 1,
          "pronunciation": "コクサイ",
          "pronunciation (en)": "kokusai",
          "reading": "コクサイ",
          "reading (en)": "kokusai"
        },
        {
          "token": "空港",
          "start_offset": 4,
          "end_offset": 6,
          "type": "word",
          "position": 2,
          "baseForm": null,
          "bytes": "[e7 a9 ba e6 b8 af]",
          "inflectionForm": null,
          "inflectionForm (en)": null,
          "inflectionType": null,
          "inflectionType (en)": null,
          "partOfSpeech": "名詞-一般",
          "partOfSpeech (en)": "noun-common",
          "positionLength": 1,
          "pronunciation": "クーコー",
          "pronunciation (en)": "kuko",
          "reading": "クウコウ",
          "reading (en)": "kuukō"
        }
      ]
    },
    "tokenfilters": []
  }
}

まとめに代えての雑感

私にFAST ESPを教えてくれた人がよく例に出すことば「東京都の山寺」（ひがしきょうとのやまでら or とうきょうとのやまでら）。
kuromojiだと、「東京」「都」になるので、「京都」だとひっかからない。
というのもあるので、形態素、n-gram併用型がお気に入りの設定です。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

Elasticsearch Analyze APIでkuromoji形態素解析を試す

概要

Analyze APIの確認