はじめに
Node-REDでも自然言語処理の類似検索を実装させたくて昨年末くらいに試した(冬休みの自由研究)事を書きます。
Elasticsearch 7.3あたりから Dense vector型の検索が扱えるようになりました。Elasticsearchでベクトル型が扱えるという事は、「1対1の比較」ではなく「検索できる」という事です。
といっても、Elasticsearchでベクトルの算出をするものではないので、Node-REDで頑張ってベクトルの算出をします。
まぁ、個人的に「性能」とか「精度」はどうでもいいので、ラズパイのNode-REDでクラウドサービスに頼らずに自然言語処理の機械学習のような事をしたかっただけです。
環境
- Raspberry Pi 4B (RAM:4GB)
- Raspbian (Buster)
- Node-RED v1.0.3 (node v12.14.0/npm 6.13.4)
- Elasticsearch 7.9.0 + kuromoji plugin(自宅kubernetes上に構築済)
Node-REDの設定
word2vecのモジュールをインストールして、Node-REDから利用できるように設定します。
word2vecのインストール
Node-REDを実行しているユーザの ~/.node-red ディレクトリへ移動して npmコマンドでインストールします。
普通にインストールできるのですが、実はARM用のバイナリではありません。
root@chino:~# cd .node-red/
root@chino:~/.node-red# npm -v
6.13.4
root@chino:~/.node-red# npm install word2vec
> word2vec@1.1.4 postinstall /root/.node-red/node_modules/word2vec
> make --directory=src
make: ディレクトリ '/root/.node-red/node_modules/word2vec/src' に入ります
make: 'all' に対して行うべき事はありません.
make: ディレクトリ '/root/.node-red/node_modules/word2vec/src' から出ます
npm WARN node-red-project@0.0.1 No repository field.
npm WARN node-red-project@0.0.1 No license field.
npm WARN optional SKIPPING OPTIONAL DEPENDENCY: xpc-connection@0.1.4 (node_modules/xpc-connection):
npm WARN notsup SKIPPING OPTIONAL DEPENDENCY: Unsupported platform for xpc-connection@0.1.4: wanted {"os":"darwin","arch":"any"} (current: {"os":"linux","arch":"arm"})
+ word2vec@1.1.4
added 23 packages from 5 contributors and audited 343 packages in 7.708s
6 packages are looking for funding
run `npm fund` for details
found 7 vulnerabilities (6 low, 1 critical)
run `npm audit fix` to fix them, or `npm audit` for details
バイナリがx86_64用のようで、インストールしただけでは使えません。
fileコマンドでx86-64のELFバイナリである事が確認できます。
root@chino:~/.node-red# cd node_modules/word2vec/src
root@chino:~/.node-red/node_modules/word2vec/src# ls -al
合計 268
drwxr-xr-x 2 root root 4096 8月 8 22:14 .
drwxr-xr-x 5 root root 4096 8月 8 22:14 ..
-rw-r--r-- 1 root root 11358 10月 26 1985 LICENSE
-rw-r--r-- 1 root root 1209 10月 26 1985 README.txt
-rwxr-xr-x 1 root root 17328 10月 26 1985 compute-accuracy
-rw-r--r-- 1 root root 5221 10月 26 1985 compute-accuracy.c
-rwxr-xr-x 1 root root 631 10月 26 1985 demo-analogy.sh
-rwxr-xr-x 1 root root 358 10月 26 1985 demo-classes.sh
-rwxr-xr-x 1 root root 885 10月 26 1985 demo-phrase-accuracy.sh
-rwxr-xr-x 1 root root 853 10月 26 1985 demo-phrases.sh
-rwxr-xr-x 1 root root 5126 10月 26 1985 demo-train-big-model-v1.sh
-rwxr-xr-x 1 root root 414 10月 26 1985 demo-word-accuracy.sh
-rwxr-xr-x 1 root root 272 10月 26 1985 demo-word.sh
-rwxr-xr-x 1 root root 21312 10月 26 1985 distance
-rw-r--r-- 1 root root 4557 10月 26 1985 distance.c
-rw-r--r-- 1 root root 741 10月 26 1985 makefile
-rwxr-xr-x 1 root root 21272 10月 26 1985 word-analogy
-rw-r--r-- 1 root root 4664 10月 26 1985 word-analogy.c
-rwxr-xr-x 1 root root 22520 10月 26 1985 word2phrase
-rw-r--r-- 1 root root 9387 10月 26 1985 word2phrase.c
-rwxr-xr-x 1 root root 52688 10月 26 1985 word2vec
-rw-r--r-- 1 root root 26195 10月 26 1985 word2vec.c
root@chino:~/.node-red/node_modules/word2vec/src# file word2vec
word2vec: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=c615749e7ef3d22ea0cd261a42e2e17551a925cf, not stripped
makefileがあるので、もしやと思ったのですが、再コンパイルすれば利用できるようになりました。
root@chino:~/.node-red/node_modules/word2vec/src# make clean
rm -rf word2vec word2phrase distance word-analogy compute-accuracy
root@chino:~/.node-red/node_modules/word2vec/src# make
gcc word2vec.c -o word2vec -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector
word2vec.c: In function ‘TrainModelThread’:
word2vec.c:366:36: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
unsigned long long next_random = (long long)id;
^
word2vec.c:372:50: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
fseek(fi, file_size / (long long)num_threads * (long long)id, SEEK_SET);
^
word2vec.c:413:54: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
fseek(fi, file_size / (long long)num_threads * (long long)id, SEEK_SET);
^
gcc word2phrase.c -o word2phrase -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector
gcc distance.c -o distance -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector
distance.c: In function ‘main’:
distance.c:31:8: warning: unused variable ‘ch’ [-Wunused-variable]
char ch;
^~
gcc word-analogy.c -o word-analogy -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector
word-analogy.c: In function ‘main’:
word-analogy.c:31:8: warning: unused variable ‘ch’ [-Wunused-variable]
char ch;
^~
gcc compute-accuracy.c -o compute-accuracy -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector
compute-accuracy.c: In function ‘main’:
compute-accuracy.c:28:109: warning: unused variable ‘ch’ [-Wunused-variable]
char st1[max_size], st2[max_size], st3[max_size], st4[max_size], bestw[N][max_size], file_name[max_size], ch;
^~
chmod +x *.sh
コンパイルエラーが出たらソースまで修正する覚悟だったのですが、WARNINGだけでしたね。
settings.jsにword2vecを追加
Node-REDの設定ファイル「settings.js」の「functionGlobalContext」に word2vecの定義を追加して、Node-REDを再起動します。
root@chino:~/.node-red/node_modules/word2vec/src# cd -
/root/.node-red
root@chino:~/.node-red# vi settings.js
: (略)
functionGlobalContext: {
os: require('os'),
mdns: require('multicast-dns'),
gnuplot: require('gnuplot'),
word2vec: require('word2vec')
},
: (略)
root@chino:~/.node-red# systemctl restart nodered
データ初期化
Elasticsearchのインデックス作成と、settings/mappingsを定義して、コーパス用のファイルを初期化します。
フローを書き出したJSONは以下です。※ElasticsearchのIPアドレスや認証などはご自身の環境に合わせてください。
[{"id":"3f7774f5.a6a20c","type":"subflow","name":"Elasticsearch","info":"","category":"","in":[{"x":50,"y":30,"wires":[{"id":"c5bb15c.4c10de8"}]}],"out":[{"x":580,"y":30,"wires":[{"id":"19d2a3b.d048a5c","port":0}]}],"env":[],"color":"#E6E0F8","icon":"font-awesome/fa-database"},{"id":"eeda88c1.f8b278","type":"http request","z":"3f7774f5.a6a20c","name":"Elasticsearch","method":"use","ret":"txt","paytoqs":"ignore","url":"","tls":"","persist":false,"proxy":"","authType":"basic","x":340,"y":30,"wires":[["19d2a3b.d048a5c"]]},{"id":"c5bb15c.4c10de8","type":"change","z":"3f7774f5.a6a20c","name":"msg.headers","rules":[{"t":"delete","p":"headers","pt":"msg"},{"t":"set","p":"headers","pt":"msg","to":"{\"Content-Type\":\"application/json\",\"Connection\":\"close\"}","tot":"json"}],"action":"","property":"","from":"","to":"","reg":false,"x":170,"y":30,"wires":[["eeda88c1.f8b278"]]},{"id":"19d2a3b.d048a5c","type":"json","z":"3f7774f5.a6a20c","name":"","property":"payload","action":"","pretty":false,"x":490,"y":30,"wires":[[]]},{"id":"81ed13ed.f867b","type":"template","z":"22758675.f9ceca","name":"ES settings/mappings","field":"payload","fieldType":"msg","format":"json","syntax":"plain","template":"{\n \"settings\": {\n \"index\": {\n \"analysis\": {\n \"tokenizer\": {\n \"custom_tokenizer\": {\n \"type\": \"kuromoji_tokenizer\",\n \"mode\": \"search\",\n \"discard_punctuation\": \"true\",\n \"user_dictionary\": \"/opt/elasticsearch/data/userdict.txt\"\n }\n },\n \"filter\": {\n \"lowercase\": {\n \"type\": \"lowercase\",\n \"language\": \"greek\"\n },\n \"length\": {\n \"type\": \"length\",\n \"min\": \"2\"\n },\n \"stop\": {\n \"type\": \"stop\",\n \"stopwords\": [\"undefined\"]\n },\n \"pos\": {\n \"type\": \"kuromoji_part_of_speech\",\n \"stoptags\": [\n \"名詞-数\",\n \"その他-間投\",\n \"フィラー\",\n \"感動詞\",\n \"記号-一般\",\n \"記号-括弧開\",\n \"記号-括弧閉\",\n \"記号-句点\",\n \"記号-空白\",\n \"記号-読点\",\n \"記号\",\n \"形容詞-自立\",\n \"形容詞-接尾\",\n \"形容詞-非自立\",\n \"形容詞\",\n \"語断片\",\n \"助詞-格助詞-一般\",\n \"助詞-格助詞-引用\",\n \"助詞-格助詞-連語\",\n \"助詞-格助詞\",\n \"助詞-間投助詞\",\n \"助詞-係助詞\",\n \"助詞-終助詞\",\n \"助詞-接続助詞\",\n \"助詞-特殊\",\n \"助詞-副詞化\",\n \"助詞-副助詞\",\n \"助詞-副助詞/並立助詞/終助詞\",\n \"助詞-並立助詞\",\n \"助詞-連体化\",\n \"助詞\",\n \"助動詞\",\n \"接続詞\",\n \"接頭詞-形容詞接続\",\n \"接頭詞-数接続\",\n \"接頭詞-動詞接続\",\n \"接頭詞-名詞接続\",\n \"接頭詞\",\n \"動詞-自立\",\n \"動詞-接尾\",\n \"動詞-非自立\",\n \"動詞\",\n \"非言語音\",\n \"副詞-一般\",\n \"副詞-助詞類接続\",\n \"副詞\",\n \"連体詞\"\n ]\n }\n },\n \"analyzer\": {\n \"custom_analyzer\": {\n \"filter\": [\n \"kuromoji_baseform\",\n \"kuromoji_stemmer\",\n \"cjk_width\",\n \"ja_stop\",\n \"lowercase\",\n \"length\",\n \"stop\",\n \"pos\"\n ],\n \"type\": \"custom\",\n \"tokenizer\": \"custom_tokenizer\"\n }\n }\n }\n }\n },\n \"mappings\": {\n \"properties\": {\n \"title\": {\n \"type\": \"text\",\n \"fields\": {\n \"keyword\": {\n \"type\": \"keyword\",\n \"ignore_above\": 1024\n },\n \"token\": {\n \"type\": \"text\",\n \"analyzer\": \"custom_analyzer\",\n \"fielddata\": true\n }\n }\n },\n \"url\": {\n \"type\": \"text\"\n },\n \"vector\": {\n \"type\": \"dense_vector\",\n \"dims\": 300\n }\n }\n }\n}","output":"str","x":900,"y":80,"wires":[["6472220c.3ce12c"]]},{"id":"1774e1ed.32b88e","type":"debug","z":"22758675.f9ceca","name":"","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","x":1270,"y":80,"wires":[]},{"id":"a51fdfbb.1e407","type":"inject","z":"22758675.f9ceca","name":"ES INDEX 初期化","repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"str","x":130,"y":80,"wires":[["3c7f6014.b2a8e"]]},{"id":"3c7f6014.b2a8e","type":"change","z":"22758675.f9ceca","name":"DELETE newsrss","rules":[{"t":"set","p":"method","pt":"msg","to":"DELETE","tot":"str"},{"t":"set","p":"url","pt":"msg","to":"http://localhost:30920/newsrss","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":340,"y":80,"wires":[["ca8fdd88.d2faf"]]},{"id":"ca8fdd88.d2faf","type":"subflow:3f7774f5.a6a20c","z":"22758675.f9ceca","name":"","x":530,"y":80,"wires":[["3176165d.d4098a"]]},{"id":"3176165d.d4098a","type":"change","z":"22758675.f9ceca","name":"PUT newsrss","rules":[{"t":"set","p":"method","pt":"msg","to":"PUT","tot":"str"},{"t":"set","p":"url","pt":"msg","to":"http://localhost:30920/newsrss","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":700,"y":80,"wires":[["81ed13ed.f867b"]]},{"id":"6472220c.3ce12c","type":"subflow:3f7774f5.a6a20c","z":"22758675.f9ceca","name":"","x":1100,"y":80,"wires":[["1774e1ed.32b88e"]]},{"id":"bfc3b5a.1600348","type":"comment","z":"22758675.f9ceca","name":"Elasticsearch インデックス初期化","info":"","x":160,"y":30,"wires":[]},{"id":"101ae4a6.4d2eeb","type":"comment","z":"22758675.f9ceca","name":"コーパス削除","info":"","x":90,"y":140,"wires":[]},{"id":"dfdc80ac.73867","type":"file","z":"22758675.f9ceca","name":"","filename":"/tmp/corpus.txt","appendNewline":true,"createDir":true,"overwriteFile":"delete","encoding":"none","x":340,"y":190,"wires":[["2d4006eb.3e066a"]]},{"id":"c8cd3c4e.d1d4e","type":"inject","z":"22758675.f9ceca","name":"コーパス削除","repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"date","x":120,"y":190,"wires":[["dfdc80ac.73867"]]},{"id":"2d4006eb.3e066a","type":"debug","z":"22758675.f9ceca","name":"","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","x":540,"y":190,"wires":[]}]
カスタムアナライザを定義していますが、Kibanaのタグクラウド用なので定義しなくても大丈夫ですが....入れておいてもいいんじゃないでしょうか。
Node-REDでのベクトルデータ生成
ほとんど興味本位での独学なので、うまく説明できているかはわかりません。
単語のベクトルは大量のサンプルデータ(コーパス)から単語を抜き出し、パースして文単位でword2vecに分析させます。
単語の使い方(並び方)のようなものが欲しいので、感覚的には「何かを説明している文」が最適では無いかと思います。
ここでは実際の検索対象の文をコーパスにして単語ベクトルのモデルを作成してみます。(短い文で件数が少ないので、あまり精度は良く無いと思います)
まずは、コーパスを取得してElasticsearchに投入します。
Yahoo!のRSSからニュース記事の「タイトル」だけ100件拾いました。
フローを書き出したJSONは以下です。※ElasticsearchのIPアドレスや認証などはご自身の環境に合わせてください。
※2020/09/25:URLと括弧除去の正規表現を修正
[{"id":"3f7774f5.a6a20c","type":"subflow","name":"Elasticsearch","info":"","category":"","in":[{"x":50,"y":30,"wires":[{"id":"c5bb15c.4c10de8"}]}],"out":[{"x":580,"y":30,"wires":[{"id":"19d2a3b.d048a5c","port":0}]}],"env":[],"color":"#E6E0F8","icon":"font-awesome/fa-database"},{"id":"eeda88c1.f8b278","type":"http request","z":"3f7774f5.a6a20c","name":"Elasticsearch","method":"use","ret":"txt","paytoqs":"ignore","url":"","tls":"","persist":false,"proxy":"","authType":"basic","x":340,"y":30,"wires":[["19d2a3b.d048a5c"]]},{"id":"c5bb15c.4c10de8","type":"change","z":"3f7774f5.a6a20c","name":"msg.headers","rules":[{"t":"delete","p":"headers","pt":"msg"},{"t":"set","p":"headers","pt":"msg","to":"{\"Content-Type\":\"application/json\",\"Connection\":\"close\"}","tot":"json"}],"action":"","property":"","from":"","to":"","reg":false,"x":170,"y":30,"wires":[["eeda88c1.f8b278"]]},{"id":"19d2a3b.d048a5c","type":"json","z":"3f7774f5.a6a20c","name":"","property":"payload","action":"","pretty":false,"x":490,"y":30,"wires":[[]]},{"id":"2e78b7dc.363348","type":"xml","z":"22758675.f9ceca","name":"","property":"payload","attr":"","chr":"","x":490,"y":390,"wires":[["28f083d1.832a6c"]]},{"id":"1917fb84.d25874","type":"http request","z":"22758675.f9ceca","name":"","method":"GET","ret":"txt","paytoqs":"ignore","url":"https://news.yahoo.co.jp/rss/media/cnetj/all.xml","tls":"","persist":false,"proxy":"","authType":"","x":280,"y":310,"wires":[["2e78b7dc.363348"]]},{"id":"6a5ce588.e2abac","type":"inject","z":"22758675.f9ceca","name":"RSS1","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"str","x":100,"y":310,"wires":[["1917fb84.d25874"]]},{"id":"28f083d1.832a6c","type":"change","z":"22758675.f9ceca","name":"","rules":[{"t":"move","p":"payload.rss.channel.0.item","pt":"msg","to":"item","tot":"msg"},{"t":"set","p":"i","pt":"msg","to":"0","tot":"num"}],"action":"","property":"","from":"","to":"","reg":false,"x":650,"y":390,"wires":[["9efe39a0.332728"]]},{"id":"9efe39a0.332728","type":"switch","z":"22758675.f9ceca","name":"","property":"i","propertyType":"msg","rules":[{"t":"lt","v":"item.length","vt":"msg"},{"t":"else"}],"checkall":"false","repair":false,"outputs":2,"x":860,"y":450,"wires":[["961ec72b.6eeb88"],["78446511.c78bbc"]]},{"id":"78446511.c78bbc","type":"debug","z":"22758675.f9ceca","name":"","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"i","targetType":"msg","x":1040,"y":470,"wires":[]},{"id":"e23b58d.d99f4a8","type":"http request","z":"22758675.f9ceca","name":"","method":"GET","ret":"txt","paytoqs":"ignore","url":"https://news.yahoo.co.jp/rss/media/yonnana/all.xml","tls":"","persist":false,"proxy":"","authType":"","x":280,"y":350,"wires":[["2e78b7dc.363348"]]},{"id":"42cca356.9bee7c","type":"inject","z":"22758675.f9ceca","name":"RSS2","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"str","x":100,"y":350,"wires":[["e23b58d.d99f4a8"]]},{"id":"51d0e9e3.c0e6f8","type":"inject","z":"22758675.f9ceca","name":"RSS3","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"str","x":100,"y":390,"wires":[["b9ae1355.aaf2e"]]},{"id":"b9ae1355.aaf2e","type":"http request","z":"22758675.f9ceca","name":"","method":"GET","ret":"txt","paytoqs":"ignore","url":"https://news.yahoo.co.jp/rss/media/cnn/all.xml","tls":"","persist":false,"proxy":"","authType":"","x":280,"y":390,"wires":[["2e78b7dc.363348"]]},{"id":"a0032e91.5b3d3","type":"inject","z":"22758675.f9ceca","name":"RSS4","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"str","x":100,"y":430,"wires":[["2af9ea67.7ebcc6"]]},{"id":"2af9ea67.7ebcc6","type":"http request","z":"22758675.f9ceca","name":"","method":"GET","ret":"txt","paytoqs":"ignore","url":"https://news.yahoo.co.jp/rss/media/impress/all.xml","tls":"","persist":false,"proxy":"","authType":"","x":280,"y":430,"wires":[["2e78b7dc.363348"]]},{"id":"6cc867cc.f90428","type":"inject","z":"22758675.f9ceca","name":"RSS5","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"str","x":100,"y":470,"wires":[["c23ba14.15f666"]]},{"id":"c23ba14.15f666","type":"http request","z":"22758675.f9ceca","name":"","method":"GET","ret":"txt","paytoqs":"ignore","url":"https://news.yahoo.co.jp/rss/media/zdn_n/all.xml","tls":"","persist":false,"proxy":"","authType":"","x":280,"y":470,"wires":[["2e78b7dc.363348"]]},{"id":"a62fa79.c6bd858","type":"template","z":"22758675.f9ceca","name":"ES query","field":"payload","fieldType":"msg","format":"json","syntax":"mustache","template":"{\n \"analyzer\": \"kuromoji\",\n \"text\": \"{{{text}}}\"\n}","output":"str","x":1410,"y":300,"wires":[["8f8b5c3d.d0fd8"]]},{"id":"8f8b5c3d.d0fd8","type":"subflow:3f7774f5.a6a20c","z":"22758675.f9ceca","name":"","x":1570,"y":300,"wires":[["42f47f8e.f841b"]]},{"id":"42f47f8e.f841b","type":"change","z":"22758675.f9ceca","name":"","rules":[{"t":"move","p":"payload.tokens","pt":"msg","to":"tokens","tot":"msg"},{"t":"set","p":"j","pt":"msg","to":"0","tot":"num"},{"t":"set","p":"corpus","pt":"msg","to":"","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":1160,"y":350,"wires":[["84129293.579c8"]]},{"id":"84129293.579c8","type":"switch","z":"22758675.f9ceca","name":"","property":"j","propertyType":"msg","rules":[{"t":"lt","v":"tokens.length","vt":"msg"},{"t":"else"}],"checkall":"false","repair":false,"outputs":2,"x":1360,"y":390,"wires":[["9663168f.7f7c98"],["208b0a0c.704196"]]},{"id":"55fb0326.e51fcc","type":"change","z":"22758675.f9ceca","name":"","rules":[{"t":"set","p":"j","pt":"msg","to":"$number(j+1)\t","tot":"jsonata"}],"action":"","property":"","from":"","to":"","reg":false,"x":1710,"y":350,"wires":[["84129293.579c8"]]},{"id":"9663168f.7f7c98","type":"function","z":"22758675.f9ceca","name":"corpus += token","func":"\nmsg.corpus += msg.tokens[msg.j].token + ' ';\n\nreturn msg;","outputs":1,"noerr":0,"x":1540,"y":350,"wires":[["55fb0326.e51fcc"]]},{"id":"c3dca400.c5c3d8","type":"file","z":"22758675.f9ceca","name":"","filename":"/tmp/corpus.txt","appendNewline":true,"createDir":true,"overwriteFile":"false","encoding":"none","x":1730,"y":400,"wires":[["759510d7.c5ef5"]]},{"id":"208b0a0c.704196","type":"change","z":"22758675.f9ceca","name":"","rules":[{"t":"set","p":"payload","pt":"msg","to":"corpus","tot":"msg"}],"action":"","property":"","from":"","to":"","reg":false,"x":1540,"y":400,"wires":[["c3dca400.c5c3d8"]]},{"id":"751de0a5.241f8","type":"change","z":"22758675.f9ceca","name":"","rules":[{"t":"set","p":"i","pt":"msg","to":"$number(i+1)\t","tot":"jsonata"}],"action":"","property":"","from":"","to":"","reg":false,"x":2390,"y":400,"wires":[["9efe39a0.332728"]]},{"id":"961ec72b.6eeb88","type":"function","z":"22758675.f9ceca","name":"title,link退避","func":"\nmsg.text = msg.item[msg.i].title[0];\nmsg.link = msg.item[msg.i].link[0];\n\nreturn msg;","outputs":1,"noerr":0,"x":1060,"y":300,"wires":[["4b745250.3e1d5c"]]},{"id":"4b745250.3e1d5c","type":"change","z":"22758675.f9ceca","name":"POST _analyze","rules":[{"t":"set","p":"method","pt":"msg","to":"POST","tot":"str"},{"t":"set","p":"url","pt":"msg","to":"http://localhost:30920/_analyze","tot":"str"},{"t":"change","p":"text","pt":"msg","from":"(.*?)","fromt":"re","to":"","tot":"str"},{"t":"change","p":"text","pt":"msg","from":"\\(.*?\\)","fromt":"re","to":"","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":1240,"y":300,"wires":[["a62fa79.c6bd858"]]},{"id":"c5e36daa.90d1b","type":"comment","z":"22758675.f9ceca","name":"コーパス作成 & Elasticsearch データ投入","info":"","x":180,"y":260,"wires":[]},{"id":"445e8ca6.8d4674","type":"subflow:3f7774f5.a6a20c","z":"22758675.f9ceca","x":2230,"y":400,"wires":[["751de0a5.241f8"]]},{"id":"5a565150.dff06","type":"template","z":"22758675.f9ceca","name":"ES query","field":"payload","fieldType":"msg","format":"handlebars","syntax":"mustache","template":"{\n \"date\": \"{{{date}}}\",\n \"title\": \"{{{text}}}\",\n \"url\": \"{{{link}}}\"\n}","output":"str","x":2070,"y":400,"wires":[["445e8ca6.8d4674"]]},{"id":"759510d7.c5ef5","type":"change","z":"22758675.f9ceca","name":"POST _doc","rules":[{"t":"set","p":"method","pt":"msg","to":"POST","tot":"str"},{"t":"set","p":"url","pt":"msg","to":"http://localhost:30920/newsrss/_doc","tot":"str"},{"t":"set","p":"date","pt":"msg","to":"$now()\t","tot":"jsonata"}],"action":"","property":"","from":"","to":"","reg":false,"x":1910,"y":400,"wires":[["5a565150.dff06"]]}]
分かち書きは、Elasticsearchの_analyzeで"kuromoji"アナライザを使ってます。このせいでかなりここの処理が重いです。
ですが、検索時と同じ辞書を使いたい場合はこのやり方が合理的です。(今回はデフォルトで使ってますが、settings定義時に分かち書き用のカスタムアナライザを定義して、ユーザ辞書や同義語辞書を可視化と同じ定義にしておけば幸せになれると思います。)
フローを書き出したJSONは以下です。
[{"id":"50103ea1.5c42a","type":"function","z":"22758675.f9ceca","name":"w2v.word2vec","func":"\n// settings.jsのfunctionGlobalContextで「word2vec: require('word2vec')」と定義しておく。\nvar w2v = new global.get('word2vec');\n\n// binary: 1にしないとloadModelできない。size(次元数),window,iterはお好みで。\nw2v.word2vec('/tmp/corpus.txt','/tmp/model.bin',\n{size: 300,window: 3,min_count: 1,iter: 100,binary: 1},\nfunction(){\n msg.payload = true;\n node.send(msg);\n});\n\nreturn msg;","outputs":1,"noerr":0,"x":380,"y":600,"wires":[["67b9a2fe.9de0ac"]]},{"id":"95a550bc.fe089","type":"inject","z":"22758675.f9ceca","name":"単語ベクトルモデルの作成","topic":"","payload":"","payloadType":"date","repeat":"","crontab":"","once":false,"onceDelay":0.1,"x":160,"y":600,"wires":[["50103ea1.5c42a"]]},{"id":"c07ee446.494448","type":"comment","z":"22758675.f9ceca","name":"単語ベクトルモデル作成","info":"","x":130,"y":550,"wires":[]},{"id":"67b9a2fe.9de0ac","type":"switch","z":"22758675.f9ceca","name":"","property":"payload","propertyType":"msg","rules":[{"t":"true"}],"checkall":"false","repair":false,"outputs":1,"x":540,"y":600,"wires":[["c145fa09.be7608"]]},{"id":"c145fa09.be7608","type":"function","z":"22758675.f9ceca","name":"w2v.loadModel","func":"\nvar w2v = new global.get('word2vec');\n\nw2v.loadModel('/tmp/model.bin',\nfunction(error,model){\n msg.model = model;\n node.send(msg);\n});\n\nreturn msg;","outputs":1,"noerr":0,"x":700,"y":600,"wires":[["ef6550ca.840be"]]},{"id":"38bf8a61.931626","type":"switch","z":"22758675.f9ceca","name":"","property":"payload","propertyType":"msg","rules":[{"t":"true"}],"checkall":"false","repair":false,"outputs":1,"x":1050,"y":600,"wires":[["aec1dac0.14ab78"]]},{"id":"aec1dac0.14ab78","type":"debug","z":"22758675.f9ceca","name":"","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"model","targetType":"msg","x":1200,"y":600,"wires":[]},{"id":"ef6550ca.840be","type":"function","z":"22758675.f9ceca","name":"msg.model確認","func":"\nif(typeof msg.model !== 'undefined'){\n msg.payload = true;\n} else {\n msg.payload = null;\n}\n\nreturn msg;","outputs":1,"noerr":0,"x":890,"y":600,"wires":[["38bf8a61.931626"]]}]
最初のfunctionノードですが、コーパスからモデルを生成しています。word2vecのオプションは別途好みでチューニングしてください。
今回は短い文でデータ量も少ない前提でチューニング(min_count:1)しています。
コールバック関数でmsg.payloadにtrueを設定してコーパス生成処理の終了で後続のノードに繋がるようにしています。
2つ目のfunctionノードでモデルを読み込んでいます。
コールバック関数でmsg.modelにモデルオブジェクトをコピーして後続のノードに繋げます。
ここらで感の良い方は気付くと思いますが、Node-REDはあまり大きなモデルは読み込めない(はず)です。つまり、ボキャブラリーを絞って学習しないとモデルのロードでメモリ不足になります。
実行してモデルが作成されると、デバッグウィンドウに単語数と次元数が表示されます。
モデルの生成と読み込みが確認できたら、文のベクトルを求めてElasticsearchのデータを更新します。
フローを書き出したJSONは以下です。※ElasticsearchのIPアドレスや認証などはご自身の環境に合わせてください。
[{"id":"3f7774f5.a6a20c","type":"subflow","name":"Elasticsearch","info":"","category":"","in":[{"x":50,"y":30,"wires":[{"id":"c5bb15c.4c10de8"}]}],"out":[{"x":580,"y":30,"wires":[{"id":"19d2a3b.d048a5c","port":0}]}],"env":[],"color":"#E6E0F8","icon":"font-awesome/fa-database"},{"id":"eeda88c1.f8b278","type":"http request","z":"3f7774f5.a6a20c","name":"Elasticsearch","method":"use","ret":"txt","paytoqs":false,"url":"","tls":"","persist":false,"proxy":"","authType":"basic","x":340,"y":30,"wires":[["19d2a3b.d048a5c"]]},{"id":"c5bb15c.4c10de8","type":"change","z":"3f7774f5.a6a20c","name":"msg.headers","rules":[{"t":"delete","p":"headers","pt":"msg"},{"t":"set","p":"headers","pt":"msg","to":"{\"Content-Type\":\"application/json\",\"Connection\":\"close\"}","tot":"json"}],"action":"","property":"","from":"","to":"","reg":false,"x":170,"y":30,"wires":[["eeda88c1.f8b278"]]},{"id":"19d2a3b.d048a5c","type":"json","z":"3f7774f5.a6a20c","name":"","property":"payload","action":"","pretty":false,"x":490,"y":30,"wires":[[]]},{"id":"3972c355.e855cc","type":"inject","z":"22758675.f9ceca","name":"文ベクトル埋め込み","topic":"","payload":"文ベクトル埋め込み","payloadType":"str","repeat":"","crontab":"","once":false,"onceDelay":0.1,"x":140,"y":720,"wires":[["2ff2a0fe.054bf"]]},{"id":"2ff2a0fe.054bf","type":"function","z":"22758675.f9ceca","name":"w2v.loadModel","func":"\nvar w2v = new global.get('word2vec');\n\nw2v.loadModel('/tmp/model.bin',\nfunction(error,model){\n msg.model = model;\n node.send(msg);\n});\n\nreturn msg;","outputs":1,"noerr":0,"x":340,"y":720,"wires":[["e4a708de.e21c68"]]},{"id":"e4a708de.e21c68","type":"function","z":"22758675.f9ceca","name":"msg.model確認","func":"\nif(typeof msg.model !== 'undefined'){\n msg.payload = true;\n} else {\n msg.payload = null;\n}\n\nreturn msg;","outputs":1,"noerr":0,"x":530,"y":720,"wires":[["ce28b330.4c4b6"]]},{"id":"ce28b330.4c4b6","type":"switch","z":"22758675.f9ceca","name":"","property":"payload","propertyType":"msg","rules":[{"t":"true"}],"checkall":"false","repair":false,"outputs":1,"x":690,"y":720,"wires":[["bdc08afc.c94948"]]},{"id":"998a3a4b.5ba7e8","type":"function","z":"22758675.f9ceca","name":"w2v.getVectors","func":"\nmsg.array = [];\n\nif(typeof msg.tokens !== 'undefined'){\n for(var i=0;i<msg.tokens.length;i++){\n msg.array.push(msg.tokens[i].token);\n }\n}\n\nmsg.payload = msg.model.getVectors(msg.array);\n\nmsg.swem = [];\n\nreturn msg;","outputs":1,"noerr":0,"x":890,"y":840,"wires":[["47c71198.d290a"]]},{"id":"47c71198.d290a","type":"function","z":"22758675.f9ceca","name":"SWEM(max)もどき","func":"\nvar keys;\nvar d;\n\nfor(var i=0;i< msg.payload.length;i++){\n keys = Object.keys(msg.payload[i].values);\n\n for(d=0;d<keys.length;d++){ //d=次元数\n if(i === 0){\n msg.swem[d] = msg.payload[i].values[keys[d]];\n } else {\n msg.swem[d] = (msg.payload[i].values[keys[d]] > msg.swem[d])? msg.payload[i].values[keys[d]]:msg.swem[d];\n }\n }\n}\n\nreturn msg;","outputs":1,"noerr":0,"x":1090,"y":840,"wires":[["ba6d8af8.1e84e8"]]},{"id":"f1b3342.5bfdec8","type":"template","z":"22758675.f9ceca","name":"ES query","field":"payload","fieldType":"msg","format":"json","syntax":"mustache","template":"{\n \"query\": {\n \"match_all\": {}\n },\n \"sort\": {\n \"date\": {\n \"order\": \"desc\"\n }\n },\n \"size\": 10000,\n \"_source\": [\"title\"]\n}","output":"str","x":1080,"y":720,"wires":[["25c663d4.cf5fec"]]},{"id":"bdc08afc.c94948","type":"change","z":"22758675.f9ceca","name":"POST newsrss/_search","rules":[{"t":"set","p":"method","pt":"msg","to":"POST","tot":"str"},{"t":"set","p":"url","pt":"msg","to":"http://localhost:30920/newsrss/_search","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":880,"y":720,"wires":[["f1b3342.5bfdec8"]]},{"id":"25c663d4.cf5fec","type":"subflow:3f7774f5.a6a20c","z":"22758675.f9ceca","name":"","x":1240,"y":720,"wires":[["ae3babb0.5ca078"]]},{"id":"431cc7fc.05acc8","type":"change","z":"22758675.f9ceca","name":"tokens退避","rules":[{"t":"move","p":"payload.tokens","pt":"msg","to":"tokens","tot":"msg"}],"action":"","property":"","from":"","to":"","reg":false,"x":1490,"y":780,"wires":[["998a3a4b.5ba7e8"]]},{"id":"ae3babb0.5ca078","type":"change","z":"22758675.f9ceca","name":"","rules":[{"t":"move","p":"payload.hits.hits","pt":"msg","to":"hits","tot":"msg"},{"t":"set","p":"i","pt":"msg","to":"0","tot":"num"}],"action":"","property":"","from":"","to":"","reg":false,"x":380,"y":780,"wires":[["affeb34f.a1dad"]]},{"id":"7007a3b.cbd015c","type":"function","z":"22758675.f9ceca","name":"id,title退避","func":"\nmsg.id = msg.hits[msg.i]._id;\nmsg.title = msg.hits[msg.i]._source.title;\n\nreturn msg;","outputs":1,"noerr":0,"x":810,"y":780,"wires":[["1c258ca3.7db1c3"]]},{"id":"affeb34f.a1dad","type":"switch","z":"22758675.f9ceca","name":"","property":"i","propertyType":"msg","rules":[{"t":"lt","v":"hits.length","vt":"msg"},{"t":"else"}],"checkall":"false","repair":false,"outputs":2,"x":600,"y":900,"wires":[["7007a3b.cbd015c"],["da98d14c.f3f2d"]]},{"id":"9bf2452f.f8a6c8","type":"template","z":"22758675.f9ceca","name":"ES query","field":"payload","fieldType":"msg","format":"json","syntax":"mustache","template":"{\n \"analyzer\": \"kuromoji\",\n \"text\": \"{{{title}}}\"\n}","output":"str","x":1160,"y":780,"wires":[["d69b04a6.f07038"]]},{"id":"d69b04a6.f07038","type":"subflow:3f7774f5.a6a20c","z":"22758675.f9ceca","name":"","x":1320,"y":780,"wires":[["431cc7fc.05acc8"]]},{"id":"1c258ca3.7db1c3","type":"change","z":"22758675.f9ceca","name":"POST _analyze","rules":[{"t":"set","p":"method","pt":"msg","to":"POST","tot":"str"},{"t":"set","p":"url","pt":"msg","to":"http://localhost:30920/_analyze","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":990,"y":780,"wires":[["9bf2452f.f8a6c8"]]},{"id":"ba6d8af8.1e84e8","type":"change","z":"22758675.f9ceca","name":"POST _update/id","rules":[{"t":"set","p":"method","pt":"msg","to":"POST","tot":"str"},{"t":"set","p":"url","pt":"msg","to":"http://localhost:30920/newsrss/_update/","tot":"str"},{"t":"change","p":"url","pt":"msg","from":"$","fromt":"re","to":"id","tot":"msg"}],"action":"","property":"","from":"","to":"","reg":false,"x":1300,"y":840,"wires":[["ff013073.faa16"]]},{"id":"ff013073.faa16","type":"template","z":"22758675.f9ceca","name":"ES query","field":"payload","fieldType":"msg","format":"json","syntax":"mustache","template":"{\n \"doc\": {\n \"vector\": [{{{swem}}}]\n }\n}","output":"str","x":1480,"y":840,"wires":[["41d38f58.ce732"]]},{"id":"aecc912c.2aea7","type":"change","z":"22758675.f9ceca","name":"","rules":[{"t":"set","p":"i","pt":"msg","to":"$number(i+1)\t","tot":"jsonata"}],"action":"","property":"","from":"","to":"","reg":false,"x":1800,"y":840,"wires":[["affeb34f.a1dad"]]},{"id":"41d38f58.ce732","type":"subflow:3f7774f5.a6a20c","z":"22758675.f9ceca","x":1640,"y":840,"wires":[["aecc912c.2aea7"]]},{"id":"da98d14c.f3f2d","type":"debug","z":"22758675.f9ceca","name":"","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"i","targetType":"msg","x":790,"y":930,"wires":[]},{"id":"498109a8.b56ac8","type":"comment","z":"22758675.f9ceca","name":"文ベクトルの埋め込み","info":"","x":120,"y":670,"wires":[]}]
見たままの説明で申し訳ないですが、「w2v.getVectors」のfunctionノードで分かち書きしたトークンを配列に入れて、モデルオブジェクトのgetVectors()に渡した結果をmsg.payloadに入れています。
文ベクトルの生成は私なりに解釈したSWEMアルゴリズム(max-pooling)です。
「もどき」と書いているのは未知語にランダム値を入れる処理が無いからです。コーパスの単語を漏れなくモデルにしているので、未知語は無いはずです。
実行すると、Elasitcsearchのデータに文ベクトルが入ります。
Kibanaでインデックスパターンを定義してDiscoverで確認すると、以下のようなデータが確認できるはずです。
ちなみに、kibanaからは何故か型が「unknown」となってしまいます。
Discoverではこんな感じで「vector」がずらっと...
ちなみに、さらっと重要な事を書きますが、Elasticsearchでベクトルデータを検索する場合は、検索範囲の対象データ全てにベクトルデータが埋め込まれていないと検索自体がエラーになります。
Kibanaと連携して検索する
まずは、検索結果を出力する処理をNode-REDで作成します。
httpでidにElasticsearchのドキュメントIDを渡してもらい、そのドキュメントの「タイトル」について通常のTF-IDF検索とCos類似度検索およびTF-IDF*Cos類似度の結果をHTMLで出力させます。
フローを書き出したJSONは以下です。※ElasticsearchのIPアドレスや認証などはご自身の環境に合わせてください。
[{"id":"3f7774f5.a6a20c","type":"subflow","name":"Elasticsearch","info":"","category":"","in":[{"x":50,"y":30,"wires":[{"id":"c5bb15c.4c10de8"}]}],"out":[{"x":580,"y":30,"wires":[{"id":"19d2a3b.d048a5c","port":0}]}],"env":[],"color":"#E6E0F8","icon":"font-awesome/fa-database"},{"id":"eeda88c1.f8b278","type":"http request","z":"3f7774f5.a6a20c","name":"Elasticsearch","method":"use","ret":"txt","paytoqs":false,"url":"","tls":"","persist":false,"proxy":"","authType":"basic","x":340,"y":30,"wires":[["19d2a3b.d048a5c"]]},{"id":"c5bb15c.4c10de8","type":"change","z":"3f7774f5.a6a20c","name":"msg.headers","rules":[{"t":"delete","p":"headers","pt":"msg"},{"t":"set","p":"headers","pt":"msg","to":"{\"Content-Type\":\"application/json\",\"Connection\":\"close\"}","tot":"json"}],"action":"","property":"","from":"","to":"","reg":false,"x":170,"y":30,"wires":[["eeda88c1.f8b278"]]},{"id":"19d2a3b.d048a5c","type":"json","z":"3f7774f5.a6a20c","name":"","property":"payload","action":"","pretty":false,"x":490,"y":30,"wires":[[]]},{"id":"61cd4038.0ba17","type":"template","z":"22758675.f9ceca","name":"ES query (TF-IDF * CosSim)","field":"payload","fieldType":"msg","format":"handlebars","syntax":"mustache","template":"{\n \"query\": {\n \"function_score\": {\n \"query\": {\n \"query_string\": {\n \"query\": \"{{qtokens}}\"\n }\n },\n \"functions\": [\n {\n \"script_score\": {\n \"script\": {\n \"source\": \"cosineSimilarity(params.query_vector,'vector') + 1.0\",\n \"params\": {\n \"query_vector\": [{{{result.vector}}}]\n }\n }\n }\n }\n ],\n \"score_mode\": \"multiply\",\n \"boost_mode\": \"multiply\",\n \"min_score\": 1\n }\n }\n}","output":"str","x":1100,"y":1220,"wires":[["f7cafd31.cbdc3"]]},{"id":"5eaa2de3.fa9d24","type":"http in","z":"22758675.f9ceca","name":"","url":"/similar","method":"get","upload":false,"swaggerDoc":"","x":110,"y":1060,"wires":[["247e675e.058f38"]]},{"id":"247e675e.058f38","type":"change","z":"22758675.f9ceca","name":"GET newsrss/_doc/id","rules":[{"t":"set","p":"method","pt":"msg","to":"GET","tot":"str"},{"t":"set","p":"url","pt":"msg","to":"http://localhost:30920/newsrss/_doc/","tot":"str"},{"t":"change","p":"url","pt":"msg","from":"$","fromt":"re","to":"payload.id","tot":"msg"}],"action":"","property":"","from":"","to":"","reg":false,"x":310,"y":1060,"wires":[["4e555cd7.063aa4"]]},{"id":"4e555cd7.063aa4","type":"subflow:3f7774f5.a6a20c","z":"22758675.f9ceca","name":"","x":510,"y":1060,"wires":[["22efb4bd.b72e4c"]]},{"id":"c91e69e7.83dc58","type":"change","z":"22758675.f9ceca","name":"tokens退避","rules":[{"t":"move","p":"payload.tokens","pt":"msg","to":"tokens","tot":"msg"}],"action":"","property":"","from":"","to":"","reg":false,"x":1190,"y":1060,"wires":[["ad512cf1.3884d"]]},{"id":"1facca89.7431a5","type":"template","z":"22758675.f9ceca","name":"ES query","field":"payload","fieldType":"msg","format":"json","syntax":"mustache","template":"{\n \"analyzer\": \"kuromoji\",\n \"text\": \"{{{result.title}}}\"\n}","output":"str","x":860,"y":1060,"wires":[["632e4f6b.d5e06"]]},{"id":"632e4f6b.d5e06","type":"subflow:3f7774f5.a6a20c","z":"22758675.f9ceca","name":"","env":[],"x":1020,"y":1060,"wires":[["c91e69e7.83dc58"]]},{"id":"22efb4bd.b72e4c","type":"change","z":"22758675.f9ceca","name":"POST _analyze","rules":[{"t":"set","p":"method","pt":"msg","to":"POST","tot":"str"},{"t":"set","p":"url","pt":"msg","to":"http://localhost:30920/_analyze","tot":"str"},{"t":"move","p":"payload._source","pt":"msg","to":"result","tot":"msg"}],"action":"","property":"","from":"","to":"","reg":false,"x":690,"y":1060,"wires":[["1facca89.7431a5"]]},{"id":"ad512cf1.3884d","type":"function","z":"22758675.f9ceca","name":"Query Tokens","func":"\nmsg.qtokens = \"\";\n\nfor(var i=0;i<msg.tokens.length;i++){\n msg.qtokens += msg.tokens[i].token + \" \";\n}\n\nreturn msg;","outputs":1,"noerr":0,"x":1380,"y":1060,"wires":[["67390762.bcfe88"]]},{"id":"6161be2e.9dcb2","type":"change","z":"22758675.f9ceca","name":"POST newsrss/_search","rules":[{"t":"set","p":"method","pt":"msg","to":"POST","tot":"str"},{"t":"set","p":"url","pt":"msg","to":"http://localhost:30920/newsrss/_search","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":840,"y":1220,"wires":[["61cd4038.0ba17"]]},{"id":"f7cafd31.cbdc3","type":"subflow:3f7774f5.a6a20c","z":"22758675.f9ceca","name":"","x":1320,"y":1220,"wires":[["8258234e.89eef"]]},{"id":"fff61a04.80b808","type":"http response","z":"22758675.f9ceca","name":"","statusCode":"200","headers":{"content-type":"text/html"},"x":1740,"y":1220,"wires":[]},{"id":"1d8f253d.c65a0b","type":"template","z":"22758675.f9ceca","name":"HTML","field":"payload","fieldType":"msg","format":"handlebars","syntax":"mustache","template":"<!doctype html>\n<html lang=\"ja\">\n <head>\n <meta charset=\"utf-8\">\n </head>\n <body>\n TF-IDF(デフォルト)<br>\n <table border=\"1\">\n <tr>\n <th>スコア</th>\n <th>タイトル</th>\n </tr>\n {{{out1html}}}\n </table>\n Cos類似度+1.0<br>\n <table border=\"1\">\n <tr>\n <th>スコア</th>\n <th>タイトル</th>\n </tr>\n {{{out2html}}}\n </table>\n TF-IDF * (Cos類似度+1.0)<br>\n <table border=\"1\">\n <tr>\n <th>スコア</th>\n <th>タイトル</th>\n </tr>\n {{{out3html}}}\n </table>\n </body>\n</html>","output":"str","x":1600,"y":1220,"wires":[["fff61a04.80b808"]]},{"id":"8258234e.89eef","type":"function","z":"22758675.f9ceca","name":"TABLE","func":"\nmsg.out3html = \"\";\n\nfor(var i=0;i<msg.payload.hits.hits.length;i++){\n msg.out3html += \"<tr>\";\n msg.out3html += \"<td>\" + msg.payload.hits.hits[i]._score + \"</td>\";\n msg.out3html += \"<td>\" + msg.payload.hits.hits[i]._source.title + \"</td>\";\n msg.out3html += \"</tr>\\n\";\n}\n\nreturn msg;","outputs":1,"noerr":0,"x":1470,"y":1220,"wires":[["1d8f253d.c65a0b"]]},{"id":"67390762.bcfe88","type":"change","z":"22758675.f9ceca","name":"POST newsrss/_search","rules":[{"t":"set","p":"method","pt":"msg","to":"POST","tot":"str"},{"t":"set","p":"url","pt":"msg","to":"http://localhost:30920/newsrss/_search","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":840,"y":1120,"wires":[["fdae635a.428ff"]]},{"id":"fdae635a.428ff","type":"template","z":"22758675.f9ceca","name":"ES query (TF-IDF)","field":"payload","fieldType":"msg","format":"handlebars","syntax":"mustache","template":"{\n \"query\": {\n \"query_string\": {\n \"query\": \"{{qtokens}}\"\n }\n }\n}","output":"str","x":1070,"y":1120,"wires":[["e8a572e4.7c11e"]]},{"id":"e8a572e4.7c11e","type":"subflow:3f7774f5.a6a20c","z":"22758675.f9ceca","name":"","x":1260,"y":1120,"wires":[["5d9bafc3.7829b"]]},{"id":"5d9bafc3.7829b","type":"function","z":"22758675.f9ceca","name":"TABLE","func":"\nmsg.out1html = \"\";\n\nfor(var i=0;i<msg.payload.hits.hits.length;i++){\n msg.out1html += \"<tr>\";\n msg.out1html += \"<td>\" + msg.payload.hits.hits[i]._score + \"</td>\";\n msg.out1html += \"<td>\" + msg.payload.hits.hits[i]._source.title + \"</td>\";\n msg.out1html += \"</tr>\\n\";\n}\n\nreturn msg;","outputs":1,"noerr":0,"x":1410,"y":1120,"wires":[["80fae3a3.bc9d3"]]},{"id":"7d23737a.f6b10c","type":"template","z":"22758675.f9ceca","name":"ES query (CosSim)","field":"payload","fieldType":"msg","format":"handlebars","syntax":"mustache","template":"{\n \"query\": {\n \"function_score\": {\n \"query\": {\n \"match_all\": {}\n },\n \"functions\": [\n {\n \"script_score\": {\n \"script\": {\n \"source\": \"cosineSimilarity(params.query_vector,'vector') + 1.0\",\n \"params\": {\n \"query_vector\": [{{{result.vector}}}]\n }\n }\n }\n }\n ],\n \"score_mode\": \"multiply\",\n \"boost_mode\": \"multiply\",\n \"min_score\": 1\n }\n }\n}","output":"str","x":1070,"y":1170,"wires":[["58f2114a.ff0ca"]]},{"id":"80fae3a3.bc9d3","type":"change","z":"22758675.f9ceca","name":"POST newsrss/_search","rules":[{"t":"set","p":"method","pt":"msg","to":"POST","tot":"str"},{"t":"set","p":"url","pt":"msg","to":"http://localhost:30920/newsrss/_search","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":840,"y":1170,"wires":[["7d23737a.f6b10c"]]},{"id":"58f2114a.ff0ca","type":"subflow:3f7774f5.a6a20c","z":"22758675.f9ceca","name":"","x":1260,"y":1170,"wires":[["7b231146.dbfc6"]]},{"id":"7b231146.dbfc6","type":"function","z":"22758675.f9ceca","name":"TABLE","func":"\nmsg.out2html = \"\";\n\nfor(var i=0;i<msg.payload.hits.hits.length;i++){\n msg.out2html += \"<tr>\";\n msg.out2html += \"<td>\" + msg.payload.hits.hits[i]._score + \"</td>\";\n msg.out2html += \"<td>\" + msg.payload.hits.hits[i]._source.title + \"</td>\";\n msg.out2html += \"</tr>\\n\";\n}\n\nreturn msg;","outputs":1,"noerr":0,"x":1410,"y":1170,"wires":[["6161be2e.9dcb2"]]},{"id":"49875731.aed0e8","type":"comment","z":"22758675.f9ceca","name":"kibana連携URL","info":"","x":110,"y":1010,"wires":[]}]
functionノードは大した事をしていないので、Cos類似度検索のElasticsearchのクエリがどうなっているかだけ説明しておきます。
といっても、よく見かけるscript_scoreとなり、負の値が扱えないので+1.0をしています。検索するベクトル値の代入ですが、Node-REDだとtemplateノードで配列のオブジェクトを渡すだけだったのでめちゃくちゃ楽でした...
次はKibanaのインデックスパターンで「_id」をURLにします。作成したNode-REDのURLがうまく呼び出せるよう設定してください。
これで、Discoverなどから_idのリンクをクリックすると、Node-REDの検索処理結果が出せるようになります。お手軽にhttpのエンドポイントを作れるNode-REDとKibanaのインデックスパターンの連携もかなり相性良いですね。
良さそうな出力結果は以下です。上位の3件を見て欲しいのですが、各々の検索の特徴が出ていると思います。(説明できませんが...)
類似度が高すぎるので、やはりデータ量の少なさから類似度の精度が悪いですね。
おわりに
実際に使い物になるレベルにするにはまだまだ問題が多いですが、パブリッククラウドに不安を感じてたり、社内の閉じた環境などでは参考になるのではないでしょうか。
分かち書き時のフィルタや、ユーザ辞書・類似語辞書もElasticsearchでカスタマイズできるので、かなりコンパクトにまとまった自然言語処理の機械学習セットではないかと思います。