More than 1 year has passed since last update.

日本語WordNetを同義語辞書として使う

Elasticsearch

Posted at 2023-10-09

前提

macOS 13.0（22A380）
Docker version 24.0.6, build ed223bc
Elasticsearch 7.17.10
elasticsearch-model (7.2.1)

手順

※ DockerでESが動いている前提

辞書の用意

そもそも日本語WordNetとは？という方は簡単に調べたので↓

「日本語 WordNet (1.1) 最新版」をダウンロードする
https://bond-lab.github.io/wnja/jpn/downloads.html

Japanese Wordnet and English WordNet in an sqlite3 database

次に、ESに設定できるフォーマットに変換する
今回は、Solr format（要はカンマ区切り）

ちなみに、同義語は「同じ概念に紐づく単語」としている

sqlite3 db/wnjpn.db

# 出力先の変更
> .output /tmp/jpn_wordnet_synonym.txt

# 対象は、名詞かつ日本語
> select group_concat(w.lemma)
> from word as w inner join sense as s on w.wordid = s.wordid
> where w.pos = 'n' and w.lang = 'jpn'
> group by s.synset
> having count(1) >= 2
> ;

docker cp /tmp/jpn_wordnet_synonym.txt [コンテナID]:/usr/share/elasticsearch/data/

インデックス作成

elasticsearch-model の設定

  settings index: { number_of_shards: 1 } do
    mappings dynamic: 'false' do
      indexes :value, type: 'text', analyzer: 'synonym_analyzer'
    end
  end

  settings analysis: {
    "filter": {
      "wordnet_synonym": {
        "type": "synonym_graph",
        "lenient": true,
        "synonyms_path": "/usr/share/elasticsearch/data/jpn_wordnet_synonym.txt"
      }
    },
    "analyzer": {
      "synonym_analyzer": {
        "tokenizer": "sudachi_tokenizer", # 別途sudachi使って設定してます
        "type": "custom",
        "char_filter": [],
        "filter": [
          "wordnet_synonym"
        ]
      }
    }
  }

結果確認

GET [インデックス名]/_analyze
{
  "analyzer": "synonym_analyzer",
  "text" : "一撃"
}

=>

{
  "tokens" : [
    {
      "token" : "一撃",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    }
  ]
}

作業後に同じコマンドを実行すると、

{
  "tokens" : [
    {
      "token" : "当り",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "ヒット",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "一",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "適中",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "当たり",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "強打",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "スラッグ",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "痛打",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "ワンツー",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "一撃",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "発",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 1
    }
  ]
}

同義語が展開されている🎉

以下は要検討

「一発」が分解されているのはちょっと
異なる概念に同じ単語がある場合はそれらをまとめて扱ってしまうのか

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up