More than 5 years have passed since last update.

ElasticsearchのSynonym追加において一部の日本語の文字でillegal_argument_exceptionが出る問題

Posted at 2018-05-18

はじめに

日本語でsynonymを追加しょうと考えたが、
なぜか下記のような設定にしていると「株式会社」といった同義語を追加しようとするとエラーになる現象があった。
「株式会」や「式会社」「株式」「会社」のどれでもエラーにならないにも関わらず、「株式会社」でエラーになる。
Pythonを利用しているが、他の環境でもおそらく同様の事象が起きると思われる。

# python辞書（JSONではない。pythonでelasticsearchを利用しているため。）
{
        "settings": {
            "analysis": {
                "analyzer": {
                    "default": {
                        "tokenizer": "kuromoji_tokenizer",
                        "filter": [
                            "my_synonym"
                        ]
                    }
                },
                "filter": {
                    "my_synonym": {
                        "type": "synonym",
                        "synonyms": [
                            '株式会社,アメリカンフットボール'
                        ]
                    }
                }
            }
        }
    }

elasticsearch.exceptions.RequestError: TransportError(400, 'illegal_argument_exception', 'failed to build synonyms')

このようなエラーが出る。

Elastic 6.2.1 アップグレードkuromoji_synのエラー出ます。 - Elastic In Your Native Tongue / 日本語による質問・議論はこちら - Discuss the Elastic Stack

日本語で上手く行っていない人の例。
これと同じ現象に見舞われた。

Webで調べてみた

tokenizer？ filter？

compatibility synonym with other filter ? · Issue #27481 · elastic/elasticsearch

ここによると

This means that all your synonym rules are analyzed with an icu_tokenizer and the following filters

とあり、特殊文字がtokenizerによって処理されることが書かれている。

'% => pour cent' is not accepted because % is removed by the icu_tokenizer and therefore could never be found in a text that pass through this analyzer.

%はicu_tokenizerによってなくなってしまう。

The phonetic filter should be put after the synonym filter. The synonyms should not be checked against the phonetic form and we disallow rules that have multiple rewriting (original form + phonetic form for instance).

filterの順番も重要らしい。

think it would be possible to extend the SynonymMap parsing so that it could handle graph tokenstreams, but it wouldn't be simple.

ここによると、

Is there any other way to fix this, other than having to delete the synonyms?

I think it would be possible to extend the SynonymMap parsing so that it could handle graph tokenstreams, but it wouldn't be simple. The other immediate workaround would be to see if you really need to have the word delimiter filter in there.

解決はシンプルではない・・・など話されていて悩ましい。

その他参考：

同様の回答：Synonym using a file is not working: malformed_input_exception - Elasticsearch - Discuss the Elastic Stack

文字コード？

Synonyms not working with diacritic chars - Elasticsearch - Discuss the Elastic Stack

ファイルで読み込み、ファイルエンコードをutf-8とすれば解決したとのこと。

Synonym ÅÄÖ exception - Elasticsearch - Discuss the Elastic Stack

こちらも文字コードと言われている。

色々実験してみた

tokenizerの変更

tokenizerをwhitespaceやstandardにしてみた。
どうやらsynonym追加には成功するらしい？
ただし、肝心のsynonymが正常動作しない。

tokenizerのカスタマイズ

kuromojiトークナイザーのmodeを修正する

# python辞書（JSONではない）
{
        "settings": {
            "analysis": {
                "analyzer": {
                    "default": {
                        "tokenizer": "ja_tokenizer",
                        "type": "custom",
                        "filter": [
                            "my_synonym"
                        ]
                    }
                },
                "tokenizer": {
                    "ja_tokenizer": {
                        "type": "kuromoji_tokenizer",
                        "mode": "normal",
                    }
                },
                "filter": {
                    "my_synonym": {
                        "type": "synonym",
                        "synonyms": [
                            '株式会社,アメリカンフットボール'
                        ]
                    }
                }
            }
        }
    }

このように記述すると成功した。
ただし、kuromoji_tokenizerをsearchモードで使う場合にはどうしたらよいかわからなかった。

Elasticsearch 日本語で全文検索その２ – Hello! Elasticsearch. – Medium

こちらの例ではstandard tokenizerを使っているようです。
フレーズ検索をつかっているためかstandard tokenizerを使うとヒットしないんですよね。

Elasticsearch 2.3でKuromojiとキャッキャウフフしてみる

こちらはkuromoji_tokenizerを使っているが、きゃりーぱみゅぱみゅのテキストを入れてみてもエラーにならなかったので、たまたま問題を回避している？

ElasticsearchのAnalyzer入門〜滝沢カレンの謎インスタをヒットさせろ〜 - inFablic | Fablic, inc. Developer's Blog.

こちらもkuromoji_tokenizerをつかってsearchモードで行っているが、icu_normalizerのおかげだろうか？
それともタマタマエラーが出なかっただけ？

この人達もエラーがたまたま出ていないだけのようなきがする。
他に考えられるのはAnalyzerがsearch_analyzerかindex_analyzerかで変わっている？

searchとindexのアナライザーを個別に設定

ここあたりを参考にmappingにsearch_analyzerとanalyzerを指定してみたが、変化は無かった。
※index_analyzerはanalyzerとして指定するように途中のバージョンから変わったらしい。

結論

時間の関係でまだ完全に調査しきれていないが、kuromoji_tokenizerでsynonymを扱うときにはsearchモードが利用できないのではないかと思われる。その場合はkuromoji_tokenizerをやめるか、normalモードにする必要あり？

synonymをファイルで扱うものは検証していないが、もしかしたらそちらでは解決するかもしれません。
また、explain等を利用してanalyzerによる結果をもう少し詳しく分析できれば他にも解決方法があるかもしれません。

もしも他に情報がありましたら教えていただきたいです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up