R言語で日本語WordNetを使ったときのメモ書きです(まとめる予定)。
日本語ワードネットとは?
『日本語ワードネットは日本語の概念辞書です。個々の概念はそれぞれ「synset」という単位にまとめられており、それらが他のsynsetと意味的に結びついています。』(下記サイトより。詳細も下記サイトを参考のこと)
事前準備
SQLite3データベースファイルをダウンロードして、解凍しておきます。
$ wget http://compling.hss.ntu.edu.sg/wnja/data/1.1/wnjpn.db.gz
$ gzip -d wnjpn.db.gz
Rから使う
{dplyr}のsrc_sqlite()
でDBファイルを処理します。
接続
wordnet-ja-conn.R
library(dplyr)
# 解凍したSQLite3データベースファイルのパス
SET_WN_JPN <- list(
FILE = "wnjpn.db"
)
wordnet_sqlite <- dplyr::src_sqlite(path = SET_WN_JPN$FILE, create = FALSE)
> methods::show(object = wordnet_sqlite)
src: sqlite 3.8.6 [wnjpn.db]
tbls: ancestor, link_def, pos_def, sense, synlink, synset, synset_def, synset_ex, variant, word,
xlink
定義確認
wordnet-ja-def.R
# 品詞定義
dplyr::tbl(src = wordnet_sqlite, from = "pos_def") %>%
dplyr::collect(x = .)
Source: local data frame [8 x 3]
pos lang def
1 a eng adjective
2 r eng adverb
3 n eng noun
4 v eng verb
5 a jpn 形容詞
6 r jpn 副詞
7 n jpn 名詞
# リンク定義
## 詳しくは「http://compling.hss.ntu.edu.sg/wnja/」の「4 関連synsetとのリンク」を参照のこと
dplyr::tbl(src = wordnet_sqlite, from = "link_def") %>%
dplyr::collect(x = .) %>%
as.data.frame()
link lang def
1 also eng See also
2 syns eng Synonyms
3 hype eng Hypernyms
4 inst eng Instances
5 hypo eng Hyponym
6 hasi eng Has Instance
7 mero eng Meronyms
8 mmem eng Meronyms --- Member
9 msub eng Meronyms --- Substance
10 mprt eng Meronyms --- Part
11 holo eng Holonyms
12 hmem eng Holonyms --- Member
13 hsub eng Holonyms --- Substance
14 hprt eng Holonyms -- Part
15 attr eng Attributes
16 sim eng Similar to
17 enta eng Entails
18 caus eng Causes
19 dmnc eng Domain --- Category
20 dmnu eng Domain --- Usage
21 dmnr eng Domain --- Region
22 dmtc eng In Domain --- Category
23 dmtu eng In Domain --- Usage
24 dmtr eng In Domain --- Region
25 ants eng Antonyms
クエリに応じて処理
wordnet-ja-search.R
# 検索するクエリの定義
SET_SEARCH_QUERY <-list(
WORD = c("bank"),
POS = c("n")
)
# wordテーブルからクエリにマッチする単語を抽出
hit_words <- dplyr::tbl(
src = wordnet_sqlite,
from = dplyr::build_sql(
"SELECT * FROM word WHERE lemma IN (", SET_SEARCH_QUERY$WORD, ")",
"AND pos IN (", SET_SEARCH_QUERY$POS, ")"
)
) %>%
dplyr::collect() %>%
print
Source: local data frame [1 x 5]
wordid lang lemma pron pos
1 109396 eng bank NA n
# 同義語をsenseテーブルから取得
hit_words_synset <- dplyr::tbl(
src = wordnet_sqlite,
from = dplyr::build_sql(
"SELECT * FROM sense WHERE wordid IN (", dplyr::ident(x = hit_words$wordid),")"
)
) %>%
dplyr::collect() %>%
print
Source: local data frame [10 x 7]
synset wordid lang rank lexid freq src
1 09213565-n 109396 eng 0 1 25 eng-30
2 08420278-n 109396 eng 0 2 20 eng-30
3 09213434-n 109396 eng 0 3 2 eng-30
4 08462066-n 109396 eng 0 4 1 eng-30
5 13368318-n 109396 eng 0 5 0 eng-30
6 13356402-n 109396 eng 0 6 0 eng-30
7 09213828-n 109396 eng 0 7 0 eng-30
8 04139859-n 109396 eng 0 8 0 eng-30
9 02787772-n 109396 eng 0 9 0 eng-30
10 00169305-n 109396 eng 0 10 0 eng-30
# 同義語のsynsetを用いて、単語テーブルから同義語を取得
syn_word <- dplyr::left_join(
x = dplyr::tbl(
src = wordnet_sqlite,
from = dplyr::build_sql(
"SELECT synset, wordid, freq, src FROM sense WHERE synset IN ", hit_words_synset$synset
)
),
y = dplyr::tbl(src = wordnet_sqlite, from = "word"),
by = c("wordid")
) %>%
dplyr::select(synset, wordid, src, lang, lemma, pos) %>%
dplyr::collect() %>%
as.data.frame() %>%
print
synset wordid src lang lemma pos
1 00169305-n 109396 eng-30 eng bank n
2 00169305-n 181934 hand jpn バンク n
3 02787772-n 2904 eng-30 eng bank_building n
4 02787772-n 109396 eng-30 eng bank n
5 02787772-n 181934 multi jpn バンク n
6 02787772-n 215551 hand jpn 銀行 n
7 04139859-n 33190 eng-30 eng money_box n
8 04139859-n 59504 eng-30 eng savings_bank n
9 04139859-n 109396 eng-30 eng bank n
10 04139859-n 114383 eng-30 eng coin_bank n
11 04139859-n 242717 mono jpn 貯蓄銀行 n
12 04139859-n 244846 hand jpn 貯金箱 n
13 08420278-n 50873 eng-30 eng banking_concern n
14 08420278-n 84396 eng-30 eng depository_financial_institution n
15 08420278-n 94765 eng-30 eng banking_company n
16 08420278-n 109396 eng-30 eng bank n
17 08420278-n 181934 hand jpn バンク n
18 08420278-n 215551 hand jpn 銀行 n
19 08420278-n 235459 hand jpn 銭荘 n
20 08420278-n 235696 hand jpn 預金金融機関 n
21 08462066-n 109396 eng-30 eng bank n
22 08462066-n 181934 hand jpn バンク n
23 09213434-n 109396 eng-30 eng bank n
24 09213434-n 160045 hand jpn 堆 n
25 09213434-n 178189 hand jpn 堤 n
26 09213434-n 181934 hand jpn バンク n
27 09213434-n 219256 hand jpn 土手 n
28 09213565-n 109396 eng-30 eng bank n
29 09213565-n 172009 hand jpn 岸べ n
30 09213565-n 204204 hand jpn 斜面 n
31 09213565-n 205653 hand jpn 岸辺 n
32 09213565-n 208732 hand jpn 岸 n
33 09213565-n 219256 hand jpn 土手 n
34 09213828-n 10271 eng-30 eng cant n
35 09213828-n 109396 eng-30 eng bank n
36 09213828-n 116039 eng-30 eng camber n
37 13356402-n 109396 eng-30 eng bank n
38 13356402-n 215551 hand jpn 銀行 n
39 13368318-n 109396 eng-30 eng bank n
40 13368318-n 215551 multi jpn 銀行 n
# 関係する単語を抽出
## hype:上位語 hypo:下位語 hmem:構成要素(構成員) mprt:被構成要素(部分)
synlink_words <- dplyr::tbl(
src = wordnet_sqlite,
from = dplyr::build_sql(
"SELECT word.lemma, link, word.pos, sense.wordid, synset1, synset2, word.lang, sense.src FROM synlink, sense, word WHERE synset1 IN ", unique(x = syn_word$synset),
"AND synset2 = synset AND sense.wordid = word.wordid"
)
) %>%
dplyr::collect() %>%
as.data.frame() %>%
print
lemma link pos wordid synset1 synset2 lang src
1 airplane_maneuver hype n 61332 00169305-n 00170844-n eng eng-30
2 flight_maneuver hype n 72532 00169305-n 00170844-n eng eng-30
3 vertical_bank hypo n 28729 00169305-n 00169522-n eng eng-30
4 depository hype n 20564 02787772-n 03177349-n eng eng-30
5 deposit hype n 25267 02787772-n 03177349-n eng eng-30
6 repository hype n 41348 02787772-n 03177349-n eng eng-30
7 depositary hype n 83776 02787772-n 03177349-n eng eng-30
8 預かり所 hype n 158245 02787772-n 03177349-n jpn hand
9 倉 hype n 182060 02787772-n 03177349-n jpn hand
10 保管所 hype n 185928 02787772-n 03177349-n jpn hand
11 貯蔵室 hype n 186985 02787772-n 03177349-n jpn hand
12 受託所 hype n 188582 02787772-n 03177349-n jpn hand
13 デポー hype n 209663 02787772-n 03177349-n jpn hand
14 蔵 hype n 214943 02787772-n 03177349-n jpn hand
15 貯蔵所 hype n 228040 02787772-n 03177349-n jpn hand
16 リポジトリ hype n 244611 02787772-n 03177349-n jpn hand
17 vault mprt n 47145 02787772-n 04523831-n eng eng-30
18 bank_vault mprt n 90999 02787772-n 04523831-n eng eng-30
19 container hype n 69347 04139859-n 03094503-n eng eng-30
20 器物 hype n 164432 04139859-n 03094503-n jpn hand
21 入れもの hype n 168151 04139859-n 03094503-n jpn hand
22 容れもの hype n 185050 04139859-n 03094503-n jpn hand
23 容れ物 hype n 187629 04139859-n 03094503-n jpn hand
24 コンテナー hype n 209830 04139859-n 03094503-n jpn hand
25 器 hype n 212300 04139859-n 03094503-n jpn hand
26 入物 hype n 222590 04139859-n 03094503-n jpn hand
27 コンテナ hype n 240512 04139859-n 03094503-n jpn hand
28 入れ物 hype n 243209 04139859-n 03094503-n jpn hand
29 容物 hype n 243436 04139859-n 03094503-n jpn hand
30 容器 hype n 247435 04139859-n 03094503-n jpn hand
31 penny_bank hypo n 13159 04139859-n 03935335-n eng eng-30
32 piggy_bank hypo n 95556 04139859-n 03935335-n eng eng-30
33 貯金 hypo n 207174 04139859-n 03935335-n jpn mono
34 banking_system hmem n 35951 08420278-n 08066491-n eng eng-30
35 banking_industry hmem n 91943 08420278-n 08066491-n eng eng-30
36 銀行システム hmem n 180060 08420278-n 08066491-n jpn mono
37 financial_organisation hype n 13889 08420278-n 08054721-n eng eng-30
38 financial_institution hype n 42159 08420278-n 08054721-n eng eng-30
39 financial_organization hype n 46425 08420278-n 08054721-n eng eng-30
40 金融機関 hype n 234854 08420278-n 08054721-n jpn hand
41 credit_union hypo n 89598 08420278-n 08234628-n eng eng-30
42 federal_reserve_bank hypo n 15844 08420278-n 08350919-n eng eng-30
43 reserve_bank hypo n 94149 08420278-n 08350919-n eng eng-30
44 準備銀行 hypo n 163248 08420278-n 08350919-n jpn hand
45 agent_bank hypo n 115323 08420278-n 08418316-n eng eng-30
46 commercial_bank hypo n 84972 08420278-n 08418420-n eng eng-30
47 full_service_bank hypo n 107735 08420278-n 08418420-n eng eng-30
48 商業銀行 hypo n 189796 08420278-n 08418420-n jpn hand
49 state_bank hypo n 6514 08420278-n 08418763-n eng eng-30
50 ゴスバンク hypo n 188702 08420278-n 08418763-n jpn mono
51 lead_bank hypo n 6763 08420278-n 08418885-n eng eng-30
52 agent_bank hypo n 115323 08420278-n 08418885-n eng eng-30
53 member_bank hypo n 113068 08420278-n 08419033-n eng eng-30
54 社員銀行 hypo n 175463 08420278-n 08419033-n jpn mono
55 merchant_bank hypo n 18400 08420278-n 08419163-n eng eng-30
56 acquirer hypo n 38281 08420278-n 08419163-n eng eng-30
57 マーチャントバンク hypo n 199652 08420278-n 08419163-n jpn hand
58 acquirer hypo n 38281 08420278-n 08419562-n eng eng-30
59 取得者 hypo n 181469 08420278-n 08419562-n jpn mono
60 thrift_institution hypo n 53053 08420278-n 08422524-n eng eng-30
61 home_loan_bank hypo n 43386 08420278-n 08423298-n eng eng-30
62 array hype n 34788 08462066-n 07939382-n eng eng-30
63 列 hype n 183528 08462066-n 07939382-n jpn hand
64 配列 hype n 213303 08462066-n 07939382-n jpn hand
65 ridge hype n 110340 09213434-n 09409512-n eng eng-30
66 隆起線 hype n 208638 09213434-n 09409512-n jpn hand
67 bluff hypo n 85554 09213434-n 09224725-n eng eng-30
68 崖 hypo n 200984 09213434-n 09224725-n jpn hand
69 岸 hypo n 208732 09213434-n 09224725-n jpn hand
70 断崖 hypo n 234504 09213434-n 09224725-n jpn hand
71 sandbank hypo n 31038 09213434-n 09421799-n eng eng-30
72 州 hypo n 178843 09213434-n 09421799-n jpn multi
73 砂州 hypo n 232284 09213434-n 09421799-n jpn hand
74 砂嘴 hypo n 232395 09213434-n 09421799-n jpn multi
75 incline hype n 12108 09213565-n 09437454-n eng eng-30
76 slope hype n 26474 09213565-n 09437454-n eng eng-30
77 side hype n 95288 09213565-n 09437454-n eng eng-30
78 傾斜 hype n 157253 09213565-n 09437454-n jpn hand
79 なぞえ hype n 158222 09213565-n 09437454-n jpn hand
80 傾斜面 hype n 173490 09213565-n 09437454-n jpn hand
81 勾配 hype n 176613 09213565-n 09437454-n jpn hand
82 スロープ hype n 187845 09213565-n 09437454-n jpn hand
83 坂 hype n 194276 09213565-n 09437454-n jpn hand
84 斜面 hype n 204204 09213565-n 09437454-n jpn hand
85 のり面 hype n 247005 09213565-n 09437454-n jpn hand
86 riverbank hypo n 29468 09213565-n 09415584-n eng eng-30
87 riverside hypo n 71456 09213565-n 09415584-n eng eng-30
88 川岸 hypo n 161829 09213565-n 09415584-n jpn hand
89 川端 hypo n 164896 09213565-n 09415584-n jpn hand
90 川堤 hypo n 177971 09213565-n 09415584-n jpn hand
91 川べり hypo n 188586 09213565-n 09415584-n jpn hand
92 川っ縁 hypo n 199838 09213565-n 09415584-n jpn hand
93 河岸 hypo n 200402 09213565-n 09415584-n jpn hand
94 川ぶち hypo n 202990 09213565-n 09415584-n jpn hand
95 川辺 hypo n 203067 09213565-n 09415584-n jpn hand
96 川縁 hypo n 203940 09213565-n 09415584-n jpn hand
97 川べ hypo n 207716 09213565-n 09415584-n jpn hand
98 川ばた hypo n 229181 09213565-n 09415584-n jpn hand
99 河堤 hypo n 231694 09213565-n 09415584-n jpn hand
100 川っぷち hypo n 240536 09213565-n 09415584-n jpn hand
101 河畔 hypo n 246327 09213565-n 09415584-n jpn hand
102 waterside hypo n 112977 09213565-n 09475925-n eng eng-30
103 水辺 hypo n 162248 09213565-n 09475925-n jpn hand
104 incline hype n 12108 09213828-n 09437454-n eng eng-30
105 slope hype n 26474 09213828-n 09437454-n eng eng-30
106 side hype n 95288 09213828-n 09437454-n eng eng-30
107 傾斜 hype n 157253 09213828-n 09437454-n jpn hand
108 なぞえ hype n 158222 09213828-n 09437454-n jpn hand
109 傾斜面 hype n 173490 09213828-n 09437454-n jpn hand
110 勾配 hype n 176613 09213828-n 09437454-n jpn hand
111 スロープ hype n 187845 09213828-n 09437454-n jpn hand
112 坂 hype n 194276 09213828-n 09437454-n jpn hand
113 斜面 hype n 204204 09213828-n 09437454-n jpn hand
114 のり面 hype n 247005 09213828-n 09437454-n jpn hand
115 pecuniary_resource hype n 3874 13356402-n 13356112-n eng eng-30
116 monetary_resource hype n 21832 13356402-n 13356112-n eng eng-30
117 finances hype n 51785 13356402-n 13356112-n eng eng-30
118 funds hype n 87569 13356402-n 13356112-n eng eng-30
119 cash_in_hand hype n 105562 13356402-n 13356112-n eng eng-30
120 資金 hype n 190819 13356402-n 13356112-n jpn hand
121 stockpile hype n 21992 13368318-n 13368052-n eng eng-30
122 reserve hype n 41486 13368318-n 13368052-n eng eng-30
123 backlog hype n 75723 13368318-n 13368052-n eng eng-30
124 リザーブ hype n 165130 13368318-n 13368052-n jpn hand
125 予備 hype n 181781 13368318-n 13368052-n jpn hand
126 蓄積 hype n 188202 13368318-n 13368052-n jpn hand
127 貯え hype n 213203 13368318-n 13368052-n jpn hand
128 控え hype n 228164 13368318-n 13368052-n jpn hand
129 蓄え hype n 230209 13368318-n 13368052-n jpn hand
130 貯蓄 hype n 233381 13368318-n 13368052-n jpn hand
131 備蓄 hype n 242874 13368318-n 13368052-n jpn hand
132 blood_bank hypo n 83103 13368318-n 13368517-n eng eng-30
133 血液銀行 hypo n 159726 13368318-n 13368517-n jpn hand
134 eye_bank hypo n 82686 13368318-n 13368675-n eng eng-30
135 food_bank hypo n 78295 13368318-n 13368900-n eng eng-30
136 soil_bank hypo n 82174 13368318-n 13369282-n eng eng-30
# 岸としての意味を持つbank
syn_word %>%
dplyr::filter(synset == "09213565-n")
synset wordid src lang lemma pos
1 09213565-n 109396 eng-30 eng bank n
2 09213565-n 172009 hand jpn 岸べ n
3 09213565-n 204204 hand jpn 斜面 n
4 09213565-n 205653 hand jpn 岸辺 n
5 09213565-n 208732 hand jpn 岸 n
6 09213565-n 219256 hand jpn 土手 n
synlink_words %>%
dplyr::filter(synset1 == "09213565-n")
lemma link pos wordid synset1 synset2 lang src
1 incline hype n 12108 09213565-n 09437454-n eng eng-30
2 slope hype n 26474 09213565-n 09437454-n eng eng-30
3 side hype n 95288 09213565-n 09437454-n eng eng-30
4 傾斜 hype n 157253 09213565-n 09437454-n jpn hand
5 なぞえ hype n 158222 09213565-n 09437454-n jpn hand
6 傾斜面 hype n 173490 09213565-n 09437454-n jpn hand
7 勾配 hype n 176613 09213565-n 09437454-n jpn hand
8 スロープ hype n 187845 09213565-n 09437454-n jpn hand
9 坂 hype n 194276 09213565-n 09437454-n jpn hand
10 斜面 hype n 204204 09213565-n 09437454-n jpn hand
11 のり面 hype n 247005 09213565-n 09437454-n jpn hand
12 riverbank hypo n 29468 09213565-n 09415584-n eng eng-30
13 riverside hypo n 71456 09213565-n 09415584-n eng eng-30
14 川岸 hypo n 161829 09213565-n 09415584-n jpn hand
15 川端 hypo n 164896 09213565-n 09415584-n jpn hand
16 川堤 hypo n 177971 09213565-n 09415584-n jpn hand
17 川べり hypo n 188586 09213565-n 09415584-n jpn hand
18 川っ縁 hypo n 199838 09213565-n 09415584-n jpn hand
19 河岸 hypo n 200402 09213565-n 09415584-n jpn hand
20 川ぶち hypo n 202990 09213565-n 09415584-n jpn hand
21 川辺 hypo n 203067 09213565-n 09415584-n jpn hand
22 川縁 hypo n 203940 09213565-n 09415584-n jpn hand
23 川べ hypo n 207716 09213565-n 09415584-n jpn hand
24 川ばた hypo n 229181 09213565-n 09415584-n jpn hand
25 河堤 hypo n 231694 09213565-n 09415584-n jpn hand
26 川っぷち hypo n 240536 09213565-n 09415584-n jpn hand
27 河畔 hypo n 246327 09213565-n 09415584-n jpn hand
28 waterside hypo n 112977 09213565-n 09475925-n eng eng-30
29 水辺 hypo n 162248 09213565-n 09475925-n jpg hand
# 銀行としての意味を持つbankで絞り込み
syn_word %>%
dplyr::filter(synset == "08420278-n")
synset wordid src lang lemma pos
1 08420278-n 50873 eng-30 eng banking_concern n
2 08420278-n 84396 eng-30 eng depository_financial_institution n
3 08420278-n 94765 eng-30 eng banking_company n
4 08420278-n 109396 eng-30 eng bank n
5 08420278-n 181934 hand jpn バンク n
6 08420278-n 215551 hand jpn 銀行 n
7 08420278-n 235459 hand jpn 銭荘 n
8 08420278-n 235696 hand jpn 預金金融機関 n
synlink_words %>%
dplyr::filter(synset1 == "08420278-n")
lemma link pos wordid synset1 synset2 lang src
1 banking_system hmem n 35951 08420278-n 08066491-n eng eng-30
2 banking_industry hmem n 91943 08420278-n 08066491-n eng eng-30
3 銀行システム hmem n 180060 08420278-n 08066491-n jpn mono
4 financial_organisation hype n 13889 08420278-n 08054721-n eng eng-30
5 financial_institution hype n 42159 08420278-n 08054721-n eng eng-30
6 financial_organization hype n 46425 08420278-n 08054721-n eng eng-30
7 金融機関 hype n 234854 08420278-n 08054721-n jpn hand
8 credit_union hypo n 89598 08420278-n 08234628-n eng eng-30
9 federal_reserve_bank hypo n 15844 08420278-n 08350919-n eng eng-30
10 reserve_bank hypo n 94149 08420278-n 08350919-n eng eng-30
11 準備銀行 hypo n 163248 08420278-n 08350919-n jpn hand
12 agent_bank hypo n 115323 08420278-n 08418316-n eng eng-30
13 commercial_bank hypo n 84972 08420278-n 08418420-n eng eng-30
14 full_service_bank hypo n 107735 08420278-n 08418420-n eng eng-30
15 商業銀行 hypo n 189796 08420278-n 08418420-n jpn hand
16 state_bank hypo n 6514 08420278-n 08418763-n eng eng-30
17 ゴスバンク hypo n 188702 08420278-n 08418763-n jpn mono
18 lead_bank hypo n 6763 08420278-n 08418885-n eng eng-30
19 agent_bank hypo n 115323 08420278-n 08418885-n eng eng-30
20 member_bank hypo n 113068 08420278-n 08419033-n eng eng-30
21 社員銀行 hypo n 175463 08420278-n 08419033-n jpn mono
22 merchant_bank hypo n 18400 08420278-n 08419163-n eng eng-30
23 acquirer hypo n 38281 08420278-n 08419163-n eng eng-30
24 マーチャントバンク hypo n 199652 08420278-n 08419163-n jpn hand
25 acquirer hypo n 38281 08420278-n 08419562-n eng eng-30
26 取得者 hypo n 181469 08420278-n 08419562-n jpn mono
27 thrift_institution hypo n 53053 08420278-n 08422524-n eng eng-30
28 home_loan_bank hypo n 43386 08420278-n 08423298-n eng eng-30
毎回クエリを書くのも億劫なので、そのうち関数化します。