1
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

Node-RED+Elasticsearchでクラウドサービスに頼らず類似文検索をする

Last updated at Posted at 2020-08-29

はじめに

Node-REDでも自然言語処理の類似検索を実装させたくて昨年末くらいに試した(冬休みの自由研究)事を書きます。
Elasticsearch 7.3あたりから Dense vector型の検索が扱えるようになりました。Elasticsearchでベクトル型が扱えるという事は、「1対1の比較」ではなく「検索できる」という事です。
といっても、Elasticsearchでベクトルの算出をするものではないので、Node-REDで頑張ってベクトルの算出をします。
まぁ、個人的に「性能」とか「精度」はどうでもいいので、ラズパイのNode-REDでクラウドサービスに頼らずに自然言語処理の機械学習のような事をしたかっただけです。

環境

  • Raspberry Pi 4B (RAM:4GB)
  • Raspbian (Buster)
  • Node-RED v1.0.3 (node v12.14.0/npm 6.13.4)
  • Elasticsearch 7.9.0 + kuromoji plugin(自宅kubernetes上に構築済)

Node-REDの設定

word2vecのモジュールをインストールして、Node-REDから利用できるように設定します。

word2vecのインストール

Node-REDを実行しているユーザの ~/.node-red ディレクトリへ移動して npmコマンドでインストールします。
普通にインストールできるのですが、実はARM用のバイナリではありません。

root@chino:~# cd .node-red/
root@chino:~/.node-red# npm -v
6.13.4
root@chino:~/.node-red# npm install word2vec

> word2vec@1.1.4 postinstall /root/.node-red/node_modules/word2vec
> make --directory=src

make: ディレクトリ '/root/.node-red/node_modules/word2vec/src' に入ります
make: 'all' に対して行うべき事はありません.
make: ディレクトリ '/root/.node-red/node_modules/word2vec/src' から出ます
npm WARN node-red-project@0.0.1 No repository field.
npm WARN node-red-project@0.0.1 No license field.
npm WARN optional SKIPPING OPTIONAL DEPENDENCY: xpc-connection@0.1.4 (node_modules/xpc-connection):
npm WARN notsup SKIPPING OPTIONAL DEPENDENCY: Unsupported platform for xpc-connection@0.1.4: wanted {"os":"darwin","arch":"any"} (current: {"os":"linux","arch":"arm"})

+ word2vec@1.1.4
added 23 packages from 5 contributors and audited 343 packages in 7.708s

6 packages are looking for funding
  run `npm fund` for details

found 7 vulnerabilities (6 low, 1 critical)
  run `npm audit fix` to fix them, or `npm audit` for details

バイナリがx86_64用のようで、インストールしただけでは使えません。
fileコマンドでx86-64のELFバイナリである事が確認できます。

root@chino:~/.node-red# cd node_modules/word2vec/src
root@chino:~/.node-red/node_modules/word2vec/src# ls -al
合計 268
drwxr-xr-x 2 root root  4096  8月  8 22:14 .
drwxr-xr-x 5 root root  4096  8月  8 22:14 ..
-rw-r--r-- 1 root root 11358 10月 26  1985 LICENSE
-rw-r--r-- 1 root root  1209 10月 26  1985 README.txt
-rwxr-xr-x 1 root root 17328 10月 26  1985 compute-accuracy
-rw-r--r-- 1 root root  5221 10月 26  1985 compute-accuracy.c
-rwxr-xr-x 1 root root   631 10月 26  1985 demo-analogy.sh
-rwxr-xr-x 1 root root   358 10月 26  1985 demo-classes.sh
-rwxr-xr-x 1 root root   885 10月 26  1985 demo-phrase-accuracy.sh
-rwxr-xr-x 1 root root   853 10月 26  1985 demo-phrases.sh
-rwxr-xr-x 1 root root  5126 10月 26  1985 demo-train-big-model-v1.sh
-rwxr-xr-x 1 root root   414 10月 26  1985 demo-word-accuracy.sh
-rwxr-xr-x 1 root root   272 10月 26  1985 demo-word.sh
-rwxr-xr-x 1 root root 21312 10月 26  1985 distance
-rw-r--r-- 1 root root  4557 10月 26  1985 distance.c
-rw-r--r-- 1 root root   741 10月 26  1985 makefile
-rwxr-xr-x 1 root root 21272 10月 26  1985 word-analogy
-rw-r--r-- 1 root root  4664 10月 26  1985 word-analogy.c
-rwxr-xr-x 1 root root 22520 10月 26  1985 word2phrase
-rw-r--r-- 1 root root  9387 10月 26  1985 word2phrase.c
-rwxr-xr-x 1 root root 52688 10月 26  1985 word2vec
-rw-r--r-- 1 root root 26195 10月 26  1985 word2vec.c
root@chino:~/.node-red/node_modules/word2vec/src# file word2vec
word2vec: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=c615749e7ef3d22ea0cd261a42e2e17551a925cf, not stripped

makefileがあるので、もしやと思ったのですが、再コンパイルすれば利用できるようになりました。

root@chino:~/.node-red/node_modules/word2vec/src# make clean
rm -rf word2vec word2phrase distance word-analogy compute-accuracy
root@chino:~/.node-red/node_modules/word2vec/src# make
gcc word2vec.c -o word2vec -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector 
word2vec.c: In function ‘TrainModelThread’:
word2vec.c:366:36: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
   unsigned long long next_random = (long long)id;
                                    ^
word2vec.c:372:50: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
   fseek(fi, file_size / (long long)num_threads * (long long)id, SEEK_SET);
                                                  ^
word2vec.c:413:54: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
       fseek(fi, file_size / (long long)num_threads * (long long)id, SEEK_SET);
                                                      ^
gcc word2phrase.c -o word2phrase -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector 
gcc distance.c -o distance -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector 
distance.c: In function ‘main’:
distance.c:31:8: warning: unused variable ‘ch’ [-Wunused-variable]
   char ch;
        ^~
gcc word-analogy.c -o word-analogy -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector 
word-analogy.c: In function ‘main’:
word-analogy.c:31:8: warning: unused variable ‘ch’ [-Wunused-variable]
   char ch;
        ^~
gcc compute-accuracy.c -o compute-accuracy -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -fno-stack-protector 
compute-accuracy.c: In function ‘main’:
compute-accuracy.c:28:109: warning: unused variable ‘ch’ [-Wunused-variable]
   char st1[max_size], st2[max_size], st3[max_size], st4[max_size], bestw[N][max_size], file_name[max_size], ch;
                                                                                                             ^~
chmod +x *.sh

コンパイルエラーが出たらソースまで修正する覚悟だったのですが、WARNINGだけでしたね。

settings.jsにword2vecを追加

Node-REDの設定ファイル「settings.js」の「functionGlobalContext」に word2vecの定義を追加して、Node-REDを再起動します。

root@chino:~/.node-red/node_modules/word2vec/src# cd -
/root/.node-red
root@chino:~/.node-red# vi settings.js
           : (略)
    functionGlobalContext: {
        os: require('os'),
        mdns: require('multicast-dns'),
        gnuplot: require('gnuplot'),
        word2vec: require('word2vec')
    },
           : (略)
root@chino:~/.node-red# systemctl restart nodered

データ初期化

Elasticsearchのインデックス作成と、settings/mappingsを定義して、コーパス用のファイルを初期化します。
スクリーンショット 2020-08-29 0.33.27.png

フローを書き出したJSONは以下です。※ElasticsearchのIPアドレスや認証などはご自身の環境に合わせてください。

[{"id":"3f7774f5.a6a20c","type":"subflow","name":"Elasticsearch","info":"","category":"","in":[{"x":50,"y":30,"wires":[{"id":"c5bb15c.4c10de8"}]}],"out":[{"x":580,"y":30,"wires":[{"id":"19d2a3b.d048a5c","port":0}]}],"env":[],"color":"#E6E0F8","icon":"font-awesome/fa-database"},{"id":"eeda88c1.f8b278","type":"http request","z":"3f7774f5.a6a20c","name":"Elasticsearch","method":"use","ret":"txt","paytoqs":"ignore","url":"","tls":"","persist":false,"proxy":"","authType":"basic","x":340,"y":30,"wires":[["19d2a3b.d048a5c"]]},{"id":"c5bb15c.4c10de8","type":"change","z":"3f7774f5.a6a20c","name":"msg.headers","rules":[{"t":"delete","p":"headers","pt":"msg"},{"t":"set","p":"headers","pt":"msg","to":"{\"Content-Type\":\"application/json\",\"Connection\":\"close\"}","tot":"json"}],"action":"","property":"","from":"","to":"","reg":false,"x":170,"y":30,"wires":[["eeda88c1.f8b278"]]},{"id":"19d2a3b.d048a5c","type":"json","z":"3f7774f5.a6a20c","name":"","property":"payload","action":"","pretty":false,"x":490,"y":30,"wires":[[]]},{"id":"81ed13ed.f867b","type":"template","z":"22758675.f9ceca","name":"ES settings/mappings","field":"payload","fieldType":"msg","format":"json","syntax":"plain","template":"{\n    \"settings\": {\n        \"index\": {\n            \"analysis\": {\n                \"tokenizer\": {\n                    \"custom_tokenizer\": {\n                        \"type\": \"kuromoji_tokenizer\",\n                        \"mode\": \"search\",\n                        \"discard_punctuation\": \"true\",\n                        \"user_dictionary\": \"/opt/elasticsearch/data/userdict.txt\"\n                    }\n                },\n                \"filter\": {\n                    \"lowercase\": {\n                        \"type\": \"lowercase\",\n                        \"language\": \"greek\"\n                    },\n                    \"length\": {\n                        \"type\": \"length\",\n                        \"min\": \"2\"\n                    },\n                    \"stop\": {\n                        \"type\": \"stop\",\n                        \"stopwords\": [\"undefined\"]\n                    },\n                    \"pos\": {\n                        \"type\": \"kuromoji_part_of_speech\",\n                        \"stoptags\": [\n                                \"名詞-数\",\n                                \"その他-間投\",\n                                \"フィラー\",\n                                \"感動詞\",\n                                \"記号-一般\",\n                                \"記号-括弧開\",\n                                \"記号-括弧閉\",\n                                \"記号-句点\",\n                                \"記号-空白\",\n                                \"記号-読点\",\n                                \"記号\",\n                                \"形容詞-自立\",\n                                \"形容詞-接尾\",\n                                \"形容詞-非自立\",\n                                \"形容詞\",\n                                \"語断片\",\n                                \"助詞-格助詞-一般\",\n                                \"助詞-格助詞-引用\",\n                                \"助詞-格助詞-連語\",\n                                \"助詞-格助詞\",\n                                \"助詞-間投助詞\",\n                                \"助詞-係助詞\",\n                                \"助詞-終助詞\",\n                                \"助詞-接続助詞\",\n                                \"助詞-特殊\",\n                                \"助詞-副詞化\",\n                                \"助詞-副助詞\",\n                                \"助詞-副助詞/並立助詞/終助詞\",\n                                \"助詞-並立助詞\",\n                                \"助詞-連体化\",\n                                \"助詞\",\n                                \"助動詞\",\n                                \"接続詞\",\n                                \"接頭詞-形容詞接続\",\n                                \"接頭詞-数接続\",\n                                \"接頭詞-動詞接続\",\n                                \"接頭詞-名詞接続\",\n                                \"接頭詞\",\n                                \"動詞-自立\",\n                                \"動詞-接尾\",\n                                \"動詞-非自立\",\n                                \"動詞\",\n                                \"非言語音\",\n                                \"副詞-一般\",\n                                \"副詞-助詞類接続\",\n                                \"副詞\",\n                                \"連体詞\"\n                            ]\n                    }\n                },\n                \"analyzer\": {\n                    \"custom_analyzer\": {\n                        \"filter\": [\n                                \"kuromoji_baseform\",\n                                \"kuromoji_stemmer\",\n                                \"cjk_width\",\n                                \"ja_stop\",\n                                \"lowercase\",\n                                \"length\",\n                                \"stop\",\n                                \"pos\"\n                            ],\n                        \"type\": \"custom\",\n                        \"tokenizer\": \"custom_tokenizer\"\n                    }\n                }\n            }\n        }\n    },\n    \"mappings\": {\n        \"properties\": {\n            \"title\": {\n                \"type\": \"text\",\n                \"fields\": {\n                    \"keyword\": {\n                        \"type\": \"keyword\",\n                        \"ignore_above\": 1024\n                    },\n                    \"token\": {\n                        \"type\": \"text\",\n                        \"analyzer\": \"custom_analyzer\",\n                        \"fielddata\": true\n                    }\n                }\n            },\n            \"url\": {\n                \"type\": \"text\"\n            },\n            \"vector\": {\n                \"type\": \"dense_vector\",\n                \"dims\": 300\n            }\n        }\n    }\n}","output":"str","x":900,"y":80,"wires":[["6472220c.3ce12c"]]},{"id":"1774e1ed.32b88e","type":"debug","z":"22758675.f9ceca","name":"","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","x":1270,"y":80,"wires":[]},{"id":"a51fdfbb.1e407","type":"inject","z":"22758675.f9ceca","name":"ES INDEX 初期化","repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"str","x":130,"y":80,"wires":[["3c7f6014.b2a8e"]]},{"id":"3c7f6014.b2a8e","type":"change","z":"22758675.f9ceca","name":"DELETE newsrss","rules":[{"t":"set","p":"method","pt":"msg","to":"DELETE","tot":"str"},{"t":"set","p":"url","pt":"msg","to":"http://localhost:30920/newsrss","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":340,"y":80,"wires":[["ca8fdd88.d2faf"]]},{"id":"ca8fdd88.d2faf","type":"subflow:3f7774f5.a6a20c","z":"22758675.f9ceca","name":"","x":530,"y":80,"wires":[["3176165d.d4098a"]]},{"id":"3176165d.d4098a","type":"change","z":"22758675.f9ceca","name":"PUT newsrss","rules":[{"t":"set","p":"method","pt":"msg","to":"PUT","tot":"str"},{"t":"set","p":"url","pt":"msg","to":"http://localhost:30920/newsrss","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":700,"y":80,"wires":[["81ed13ed.f867b"]]},{"id":"6472220c.3ce12c","type":"subflow:3f7774f5.a6a20c","z":"22758675.f9ceca","name":"","x":1100,"y":80,"wires":[["1774e1ed.32b88e"]]},{"id":"bfc3b5a.1600348","type":"comment","z":"22758675.f9ceca","name":"Elasticsearch インデックス初期化","info":"","x":160,"y":30,"wires":[]},{"id":"101ae4a6.4d2eeb","type":"comment","z":"22758675.f9ceca","name":"コーパス削除","info":"","x":90,"y":140,"wires":[]},{"id":"dfdc80ac.73867","type":"file","z":"22758675.f9ceca","name":"","filename":"/tmp/corpus.txt","appendNewline":true,"createDir":true,"overwriteFile":"delete","encoding":"none","x":340,"y":190,"wires":[["2d4006eb.3e066a"]]},{"id":"c8cd3c4e.d1d4e","type":"inject","z":"22758675.f9ceca","name":"コーパス削除","repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"date","x":120,"y":190,"wires":[["dfdc80ac.73867"]]},{"id":"2d4006eb.3e066a","type":"debug","z":"22758675.f9ceca","name":"","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","x":540,"y":190,"wires":[]}]

カスタムアナライザを定義していますが、Kibanaのタグクラウド用なので定義しなくても大丈夫ですが....入れておいてもいいんじゃないでしょうか。

Node-REDでのベクトルデータ生成

ほとんど興味本位での独学なので、うまく説明できているかはわかりません。
単語のベクトルは大量のサンプルデータ(コーパス)から単語を抜き出し、パースして文単位でword2vecに分析させます。
単語の使い方(並び方)のようなものが欲しいので、感覚的には「何かを説明している文」が最適では無いかと思います。
ここでは実際の検索対象の文をコーパスにして単語ベクトルのモデルを作成してみます。(短い文で件数が少ないので、あまり精度は良く無いと思います)

まずは、コーパスを取得してElasticsearchに投入します。
Yahoo!のRSSからニュース記事の「タイトル」だけ100件拾いました。
スクリーンショット 2020-08-29 0.22.05.png

フローを書き出したJSONは以下です。※ElasticsearchのIPアドレスや認証などはご自身の環境に合わせてください。
※2020/09/25:URLと括弧除去の正規表現を修正

[{"id":"3f7774f5.a6a20c","type":"subflow","name":"Elasticsearch","info":"","category":"","in":[{"x":50,"y":30,"wires":[{"id":"c5bb15c.4c10de8"}]}],"out":[{"x":580,"y":30,"wires":[{"id":"19d2a3b.d048a5c","port":0}]}],"env":[],"color":"#E6E0F8","icon":"font-awesome/fa-database"},{"id":"eeda88c1.f8b278","type":"http request","z":"3f7774f5.a6a20c","name":"Elasticsearch","method":"use","ret":"txt","paytoqs":"ignore","url":"","tls":"","persist":false,"proxy":"","authType":"basic","x":340,"y":30,"wires":[["19d2a3b.d048a5c"]]},{"id":"c5bb15c.4c10de8","type":"change","z":"3f7774f5.a6a20c","name":"msg.headers","rules":[{"t":"delete","p":"headers","pt":"msg"},{"t":"set","p":"headers","pt":"msg","to":"{\"Content-Type\":\"application/json\",\"Connection\":\"close\"}","tot":"json"}],"action":"","property":"","from":"","to":"","reg":false,"x":170,"y":30,"wires":[["eeda88c1.f8b278"]]},{"id":"19d2a3b.d048a5c","type":"json","z":"3f7774f5.a6a20c","name":"","property":"payload","action":"","pretty":false,"x":490,"y":30,"wires":[[]]},{"id":"2e78b7dc.363348","type":"xml","z":"22758675.f9ceca","name":"","property":"payload","attr":"","chr":"","x":490,"y":390,"wires":[["28f083d1.832a6c"]]},{"id":"1917fb84.d25874","type":"http request","z":"22758675.f9ceca","name":"","method":"GET","ret":"txt","paytoqs":"ignore","url":"https://news.yahoo.co.jp/rss/media/cnetj/all.xml","tls":"","persist":false,"proxy":"","authType":"","x":280,"y":310,"wires":[["2e78b7dc.363348"]]},{"id":"6a5ce588.e2abac","type":"inject","z":"22758675.f9ceca","name":"RSS1","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"str","x":100,"y":310,"wires":[["1917fb84.d25874"]]},{"id":"28f083d1.832a6c","type":"change","z":"22758675.f9ceca","name":"","rules":[{"t":"move","p":"payload.rss.channel.0.item","pt":"msg","to":"item","tot":"msg"},{"t":"set","p":"i","pt":"msg","to":"0","tot":"num"}],"action":"","property":"","from":"","to":"","reg":false,"x":650,"y":390,"wires":[["9efe39a0.332728"]]},{"id":"9efe39a0.332728","type":"switch","z":"22758675.f9ceca","name":"","property":"i","propertyType":"msg","rules":[{"t":"lt","v":"item.length","vt":"msg"},{"t":"else"}],"checkall":"false","repair":false,"outputs":2,"x":860,"y":450,"wires":[["961ec72b.6eeb88"],["78446511.c78bbc"]]},{"id":"78446511.c78bbc","type":"debug","z":"22758675.f9ceca","name":"","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"i","targetType":"msg","x":1040,"y":470,"wires":[]},{"id":"e23b58d.d99f4a8","type":"http request","z":"22758675.f9ceca","name":"","method":"GET","ret":"txt","paytoqs":"ignore","url":"https://news.yahoo.co.jp/rss/media/yonnana/all.xml","tls":"","persist":false,"proxy":"","authType":"","x":280,"y":350,"wires":[["2e78b7dc.363348"]]},{"id":"42cca356.9bee7c","type":"inject","z":"22758675.f9ceca","name":"RSS2","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"str","x":100,"y":350,"wires":[["e23b58d.d99f4a8"]]},{"id":"51d0e9e3.c0e6f8","type":"inject","z":"22758675.f9ceca","name":"RSS3","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"str","x":100,"y":390,"wires":[["b9ae1355.aaf2e"]]},{"id":"b9ae1355.aaf2e","type":"http request","z":"22758675.f9ceca","name":"","method":"GET","ret":"txt","paytoqs":"ignore","url":"https://news.yahoo.co.jp/rss/media/cnn/all.xml","tls":"","persist":false,"proxy":"","authType":"","x":280,"y":390,"wires":[["2e78b7dc.363348"]]},{"id":"a0032e91.5b3d3","type":"inject","z":"22758675.f9ceca","name":"RSS4","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"str","x":100,"y":430,"wires":[["2af9ea67.7ebcc6"]]},{"id":"2af9ea67.7ebcc6","type":"http request","z":"22758675.f9ceca","name":"","method":"GET","ret":"txt","paytoqs":"ignore","url":"https://news.yahoo.co.jp/rss/media/impress/all.xml","tls":"","persist":false,"proxy":"","authType":"","x":280,"y":430,"wires":[["2e78b7dc.363348"]]},{"id":"6cc867cc.f90428","type":"inject","z":"22758675.f9ceca","name":"RSS5","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"str","x":100,"y":470,"wires":[["c23ba14.15f666"]]},{"id":"c23ba14.15f666","type":"http request","z":"22758675.f9ceca","name":"","method":"GET","ret":"txt","paytoqs":"ignore","url":"https://news.yahoo.co.jp/rss/media/zdn_n/all.xml","tls":"","persist":false,"proxy":"","authType":"","x":280,"y":470,"wires":[["2e78b7dc.363348"]]},{"id":"a62fa79.c6bd858","type":"template","z":"22758675.f9ceca","name":"ES query","field":"payload","fieldType":"msg","format":"json","syntax":"mustache","template":"{\n  \"analyzer\": \"kuromoji\",\n  \"text\": \"{{{text}}}\"\n}","output":"str","x":1410,"y":300,"wires":[["8f8b5c3d.d0fd8"]]},{"id":"8f8b5c3d.d0fd8","type":"subflow:3f7774f5.a6a20c","z":"22758675.f9ceca","name":"","x":1570,"y":300,"wires":[["42f47f8e.f841b"]]},{"id":"42f47f8e.f841b","type":"change","z":"22758675.f9ceca","name":"","rules":[{"t":"move","p":"payload.tokens","pt":"msg","to":"tokens","tot":"msg"},{"t":"set","p":"j","pt":"msg","to":"0","tot":"num"},{"t":"set","p":"corpus","pt":"msg","to":"","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":1160,"y":350,"wires":[["84129293.579c8"]]},{"id":"84129293.579c8","type":"switch","z":"22758675.f9ceca","name":"","property":"j","propertyType":"msg","rules":[{"t":"lt","v":"tokens.length","vt":"msg"},{"t":"else"}],"checkall":"false","repair":false,"outputs":2,"x":1360,"y":390,"wires":[["9663168f.7f7c98"],["208b0a0c.704196"]]},{"id":"55fb0326.e51fcc","type":"change","z":"22758675.f9ceca","name":"","rules":[{"t":"set","p":"j","pt":"msg","to":"$number(j+1)\t","tot":"jsonata"}],"action":"","property":"","from":"","to":"","reg":false,"x":1710,"y":350,"wires":[["84129293.579c8"]]},{"id":"9663168f.7f7c98","type":"function","z":"22758675.f9ceca","name":"corpus += token","func":"\nmsg.corpus += msg.tokens[msg.j].token + ' ';\n\nreturn msg;","outputs":1,"noerr":0,"x":1540,"y":350,"wires":[["55fb0326.e51fcc"]]},{"id":"c3dca400.c5c3d8","type":"file","z":"22758675.f9ceca","name":"","filename":"/tmp/corpus.txt","appendNewline":true,"createDir":true,"overwriteFile":"false","encoding":"none","x":1730,"y":400,"wires":[["759510d7.c5ef5"]]},{"id":"208b0a0c.704196","type":"change","z":"22758675.f9ceca","name":"","rules":[{"t":"set","p":"payload","pt":"msg","to":"corpus","tot":"msg"}],"action":"","property":"","from":"","to":"","reg":false,"x":1540,"y":400,"wires":[["c3dca400.c5c3d8"]]},{"id":"751de0a5.241f8","type":"change","z":"22758675.f9ceca","name":"","rules":[{"t":"set","p":"i","pt":"msg","to":"$number(i+1)\t","tot":"jsonata"}],"action":"","property":"","from":"","to":"","reg":false,"x":2390,"y":400,"wires":[["9efe39a0.332728"]]},{"id":"961ec72b.6eeb88","type":"function","z":"22758675.f9ceca","name":"title,link退避","func":"\nmsg.text = msg.item[msg.i].title[0];\nmsg.link = msg.item[msg.i].link[0];\n\nreturn msg;","outputs":1,"noerr":0,"x":1060,"y":300,"wires":[["4b745250.3e1d5c"]]},{"id":"4b745250.3e1d5c","type":"change","z":"22758675.f9ceca","name":"POST _analyze","rules":[{"t":"set","p":"method","pt":"msg","to":"POST","tot":"str"},{"t":"set","p":"url","pt":"msg","to":"http://localhost:30920/_analyze","tot":"str"},{"t":"change","p":"text","pt":"msg","from":"(.*?)","fromt":"re","to":"","tot":"str"},{"t":"change","p":"text","pt":"msg","from":"\\(.*?\\)","fromt":"re","to":"","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":1240,"y":300,"wires":[["a62fa79.c6bd858"]]},{"id":"c5e36daa.90d1b","type":"comment","z":"22758675.f9ceca","name":"コーパス作成 & Elasticsearch データ投入","info":"","x":180,"y":260,"wires":[]},{"id":"445e8ca6.8d4674","type":"subflow:3f7774f5.a6a20c","z":"22758675.f9ceca","x":2230,"y":400,"wires":[["751de0a5.241f8"]]},{"id":"5a565150.dff06","type":"template","z":"22758675.f9ceca","name":"ES query","field":"payload","fieldType":"msg","format":"handlebars","syntax":"mustache","template":"{\n    \"date\": \"{{{date}}}\",\n    \"title\": \"{{{text}}}\",\n    \"url\": \"{{{link}}}\"\n}","output":"str","x":2070,"y":400,"wires":[["445e8ca6.8d4674"]]},{"id":"759510d7.c5ef5","type":"change","z":"22758675.f9ceca","name":"POST _doc","rules":[{"t":"set","p":"method","pt":"msg","to":"POST","tot":"str"},{"t":"set","p":"url","pt":"msg","to":"http://localhost:30920/newsrss/_doc","tot":"str"},{"t":"set","p":"date","pt":"msg","to":"$now()\t","tot":"jsonata"}],"action":"","property":"","from":"","to":"","reg":false,"x":1910,"y":400,"wires":[["5a565150.dff06"]]}]

分かち書きは、Elasticsearchの_analyzeで"kuromoji"アナライザを使ってます。このせいでかなりここの処理が重いです。
ですが、検索時と同じ辞書を使いたい場合はこのやり方が合理的です。(今回はデフォルトで使ってますが、settings定義時に分かち書き用のカスタムアナライザを定義して、ユーザ辞書や同義語辞書を可視化と同じ定義にしておけば幸せになれると思います。)

コーパスから単語のモデルを生成します。
スクリーンショット 2020-08-29 0.58.01.png

フローを書き出したJSONは以下です。

[{"id":"50103ea1.5c42a","type":"function","z":"22758675.f9ceca","name":"w2v.word2vec","func":"\n// settings.jsのfunctionGlobalContextで「word2vec: require('word2vec')」と定義しておく。\nvar w2v = new global.get('word2vec');\n\n// binary: 1にしないとloadModelできない。size(次元数),window,iterはお好みで。\nw2v.word2vec('/tmp/corpus.txt','/tmp/model.bin',\n{size: 300,window: 3,min_count: 1,iter: 100,binary: 1},\nfunction(){\n    msg.payload = true;\n    node.send(msg);\n});\n\nreturn msg;","outputs":1,"noerr":0,"x":380,"y":600,"wires":[["67b9a2fe.9de0ac"]]},{"id":"95a550bc.fe089","type":"inject","z":"22758675.f9ceca","name":"単語ベクトルモデルの作成","topic":"","payload":"","payloadType":"date","repeat":"","crontab":"","once":false,"onceDelay":0.1,"x":160,"y":600,"wires":[["50103ea1.5c42a"]]},{"id":"c07ee446.494448","type":"comment","z":"22758675.f9ceca","name":"単語ベクトルモデル作成","info":"","x":130,"y":550,"wires":[]},{"id":"67b9a2fe.9de0ac","type":"switch","z":"22758675.f9ceca","name":"","property":"payload","propertyType":"msg","rules":[{"t":"true"}],"checkall":"false","repair":false,"outputs":1,"x":540,"y":600,"wires":[["c145fa09.be7608"]]},{"id":"c145fa09.be7608","type":"function","z":"22758675.f9ceca","name":"w2v.loadModel","func":"\nvar w2v = new global.get('word2vec');\n\nw2v.loadModel('/tmp/model.bin',\nfunction(error,model){\n    msg.model = model;\n    node.send(msg);\n});\n\nreturn msg;","outputs":1,"noerr":0,"x":700,"y":600,"wires":[["ef6550ca.840be"]]},{"id":"38bf8a61.931626","type":"switch","z":"22758675.f9ceca","name":"","property":"payload","propertyType":"msg","rules":[{"t":"true"}],"checkall":"false","repair":false,"outputs":1,"x":1050,"y":600,"wires":[["aec1dac0.14ab78"]]},{"id":"aec1dac0.14ab78","type":"debug","z":"22758675.f9ceca","name":"","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"model","targetType":"msg","x":1200,"y":600,"wires":[]},{"id":"ef6550ca.840be","type":"function","z":"22758675.f9ceca","name":"msg.model確認","func":"\nif(typeof msg.model !== 'undefined'){\n    msg.payload = true;\n} else {\n    msg.payload = null;\n}\n\nreturn msg;","outputs":1,"noerr":0,"x":890,"y":600,"wires":[["38bf8a61.931626"]]}]

最初のfunctionノードですが、コーパスからモデルを生成しています。word2vecのオプションは別途好みでチューニングしてください。
今回は短い文でデータ量も少ない前提でチューニング(min_count:1)しています。
コールバック関数でmsg.payloadにtrueを設定してコーパス生成処理の終了で後続のノードに繋がるようにしています。
スクリーンショット 2020-08-29 1.00.07.png

2つ目のfunctionノードでモデルを読み込んでいます。
コールバック関数でmsg.modelにモデルオブジェクトをコピーして後続のノードに繋げます。
ここらで感の良い方は気付くと思いますが、Node-REDはあまり大きなモデルは読み込めない(はず)です。つまり、ボキャブラリーを絞って学習しないとモデルのロードでメモリ不足になります。
スクリーンショット 2020-08-29 1.07.13.png

実行してモデルが作成されると、デバッグウィンドウに単語数と次元数が表示されます。
スクリーンショット 2020-08-29 1.32.40.png

モデルの生成と読み込みが確認できたら、文のベクトルを求めてElasticsearchのデータを更新します。
スクリーンショット 2020-08-29 1.40.07.png

フローを書き出したJSONは以下です。※ElasticsearchのIPアドレスや認証などはご自身の環境に合わせてください。

[{"id":"3f7774f5.a6a20c","type":"subflow","name":"Elasticsearch","info":"","category":"","in":[{"x":50,"y":30,"wires":[{"id":"c5bb15c.4c10de8"}]}],"out":[{"x":580,"y":30,"wires":[{"id":"19d2a3b.d048a5c","port":0}]}],"env":[],"color":"#E6E0F8","icon":"font-awesome/fa-database"},{"id":"eeda88c1.f8b278","type":"http request","z":"3f7774f5.a6a20c","name":"Elasticsearch","method":"use","ret":"txt","paytoqs":false,"url":"","tls":"","persist":false,"proxy":"","authType":"basic","x":340,"y":30,"wires":[["19d2a3b.d048a5c"]]},{"id":"c5bb15c.4c10de8","type":"change","z":"3f7774f5.a6a20c","name":"msg.headers","rules":[{"t":"delete","p":"headers","pt":"msg"},{"t":"set","p":"headers","pt":"msg","to":"{\"Content-Type\":\"application/json\",\"Connection\":\"close\"}","tot":"json"}],"action":"","property":"","from":"","to":"","reg":false,"x":170,"y":30,"wires":[["eeda88c1.f8b278"]]},{"id":"19d2a3b.d048a5c","type":"json","z":"3f7774f5.a6a20c","name":"","property":"payload","action":"","pretty":false,"x":490,"y":30,"wires":[[]]},{"id":"3972c355.e855cc","type":"inject","z":"22758675.f9ceca","name":"文ベクトル埋め込み","topic":"","payload":"文ベクトル埋め込み","payloadType":"str","repeat":"","crontab":"","once":false,"onceDelay":0.1,"x":140,"y":720,"wires":[["2ff2a0fe.054bf"]]},{"id":"2ff2a0fe.054bf","type":"function","z":"22758675.f9ceca","name":"w2v.loadModel","func":"\nvar w2v = new global.get('word2vec');\n\nw2v.loadModel('/tmp/model.bin',\nfunction(error,model){\n    msg.model = model;\n    node.send(msg);\n});\n\nreturn msg;","outputs":1,"noerr":0,"x":340,"y":720,"wires":[["e4a708de.e21c68"]]},{"id":"e4a708de.e21c68","type":"function","z":"22758675.f9ceca","name":"msg.model確認","func":"\nif(typeof msg.model !== 'undefined'){\n    msg.payload = true;\n} else {\n    msg.payload = null;\n}\n\nreturn msg;","outputs":1,"noerr":0,"x":530,"y":720,"wires":[["ce28b330.4c4b6"]]},{"id":"ce28b330.4c4b6","type":"switch","z":"22758675.f9ceca","name":"","property":"payload","propertyType":"msg","rules":[{"t":"true"}],"checkall":"false","repair":false,"outputs":1,"x":690,"y":720,"wires":[["bdc08afc.c94948"]]},{"id":"998a3a4b.5ba7e8","type":"function","z":"22758675.f9ceca","name":"w2v.getVectors","func":"\nmsg.array = [];\n\nif(typeof msg.tokens !== 'undefined'){\n    for(var i=0;i<msg.tokens.length;i++){\n        msg.array.push(msg.tokens[i].token);\n    }\n}\n\nmsg.payload = msg.model.getVectors(msg.array);\n\nmsg.swem = [];\n\nreturn msg;","outputs":1,"noerr":0,"x":890,"y":840,"wires":[["47c71198.d290a"]]},{"id":"47c71198.d290a","type":"function","z":"22758675.f9ceca","name":"SWEM(max)もどき","func":"\nvar keys;\nvar d;\n\nfor(var i=0;i< msg.payload.length;i++){\n    keys = Object.keys(msg.payload[i].values);\n\n    for(d=0;d<keys.length;d++){ //d=次元数\n        if(i === 0){\n            msg.swem[d] = msg.payload[i].values[keys[d]];\n        } else {\n            msg.swem[d] = (msg.payload[i].values[keys[d]] > msg.swem[d])? msg.payload[i].values[keys[d]]:msg.swem[d];\n        }\n    }\n}\n\nreturn msg;","outputs":1,"noerr":0,"x":1090,"y":840,"wires":[["ba6d8af8.1e84e8"]]},{"id":"f1b3342.5bfdec8","type":"template","z":"22758675.f9ceca","name":"ES query","field":"payload","fieldType":"msg","format":"json","syntax":"mustache","template":"{\n    \"query\": {\n        \"match_all\": {}\n    },\n    \"sort\": {\n        \"date\": {\n            \"order\": \"desc\"\n        }\n    },\n    \"size\": 10000,\n    \"_source\": [\"title\"]\n}","output":"str","x":1080,"y":720,"wires":[["25c663d4.cf5fec"]]},{"id":"bdc08afc.c94948","type":"change","z":"22758675.f9ceca","name":"POST newsrss/_search","rules":[{"t":"set","p":"method","pt":"msg","to":"POST","tot":"str"},{"t":"set","p":"url","pt":"msg","to":"http://localhost:30920/newsrss/_search","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":880,"y":720,"wires":[["f1b3342.5bfdec8"]]},{"id":"25c663d4.cf5fec","type":"subflow:3f7774f5.a6a20c","z":"22758675.f9ceca","name":"","x":1240,"y":720,"wires":[["ae3babb0.5ca078"]]},{"id":"431cc7fc.05acc8","type":"change","z":"22758675.f9ceca","name":"tokens退避","rules":[{"t":"move","p":"payload.tokens","pt":"msg","to":"tokens","tot":"msg"}],"action":"","property":"","from":"","to":"","reg":false,"x":1490,"y":780,"wires":[["998a3a4b.5ba7e8"]]},{"id":"ae3babb0.5ca078","type":"change","z":"22758675.f9ceca","name":"","rules":[{"t":"move","p":"payload.hits.hits","pt":"msg","to":"hits","tot":"msg"},{"t":"set","p":"i","pt":"msg","to":"0","tot":"num"}],"action":"","property":"","from":"","to":"","reg":false,"x":380,"y":780,"wires":[["affeb34f.a1dad"]]},{"id":"7007a3b.cbd015c","type":"function","z":"22758675.f9ceca","name":"id,title退避","func":"\nmsg.id = msg.hits[msg.i]._id;\nmsg.title = msg.hits[msg.i]._source.title;\n\nreturn msg;","outputs":1,"noerr":0,"x":810,"y":780,"wires":[["1c258ca3.7db1c3"]]},{"id":"affeb34f.a1dad","type":"switch","z":"22758675.f9ceca","name":"","property":"i","propertyType":"msg","rules":[{"t":"lt","v":"hits.length","vt":"msg"},{"t":"else"}],"checkall":"false","repair":false,"outputs":2,"x":600,"y":900,"wires":[["7007a3b.cbd015c"],["da98d14c.f3f2d"]]},{"id":"9bf2452f.f8a6c8","type":"template","z":"22758675.f9ceca","name":"ES query","field":"payload","fieldType":"msg","format":"json","syntax":"mustache","template":"{\n  \"analyzer\": \"kuromoji\",\n  \"text\": \"{{{title}}}\"\n}","output":"str","x":1160,"y":780,"wires":[["d69b04a6.f07038"]]},{"id":"d69b04a6.f07038","type":"subflow:3f7774f5.a6a20c","z":"22758675.f9ceca","name":"","x":1320,"y":780,"wires":[["431cc7fc.05acc8"]]},{"id":"1c258ca3.7db1c3","type":"change","z":"22758675.f9ceca","name":"POST _analyze","rules":[{"t":"set","p":"method","pt":"msg","to":"POST","tot":"str"},{"t":"set","p":"url","pt":"msg","to":"http://localhost:30920/_analyze","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":990,"y":780,"wires":[["9bf2452f.f8a6c8"]]},{"id":"ba6d8af8.1e84e8","type":"change","z":"22758675.f9ceca","name":"POST _update/id","rules":[{"t":"set","p":"method","pt":"msg","to":"POST","tot":"str"},{"t":"set","p":"url","pt":"msg","to":"http://localhost:30920/newsrss/_update/","tot":"str"},{"t":"change","p":"url","pt":"msg","from":"$","fromt":"re","to":"id","tot":"msg"}],"action":"","property":"","from":"","to":"","reg":false,"x":1300,"y":840,"wires":[["ff013073.faa16"]]},{"id":"ff013073.faa16","type":"template","z":"22758675.f9ceca","name":"ES query","field":"payload","fieldType":"msg","format":"json","syntax":"mustache","template":"{\n    \"doc\": {\n        \"vector\": [{{{swem}}}]\n    }\n}","output":"str","x":1480,"y":840,"wires":[["41d38f58.ce732"]]},{"id":"aecc912c.2aea7","type":"change","z":"22758675.f9ceca","name":"","rules":[{"t":"set","p":"i","pt":"msg","to":"$number(i+1)\t","tot":"jsonata"}],"action":"","property":"","from":"","to":"","reg":false,"x":1800,"y":840,"wires":[["affeb34f.a1dad"]]},{"id":"41d38f58.ce732","type":"subflow:3f7774f5.a6a20c","z":"22758675.f9ceca","x":1640,"y":840,"wires":[["aecc912c.2aea7"]]},{"id":"da98d14c.f3f2d","type":"debug","z":"22758675.f9ceca","name":"","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"i","targetType":"msg","x":790,"y":930,"wires":[]},{"id":"498109a8.b56ac8","type":"comment","z":"22758675.f9ceca","name":"文ベクトルの埋め込み","info":"","x":120,"y":670,"wires":[]}]

見たままの説明で申し訳ないですが、「w2v.getVectors」のfunctionノードで分かち書きしたトークンを配列に入れて、モデルオブジェクトのgetVectors()に渡した結果をmsg.payloadに入れています。
スクリーンショット 2020-08-29 1.56.26.png

文ベクトルの生成は私なりに解釈したSWEMアルゴリズム(max-pooling)です。
「もどき」と書いているのは未知語にランダム値を入れる処理が無いからです。コーパスの単語を漏れなくモデルにしているので、未知語は無いはずです。
スクリーンショット 2020-08-29 2.01.58.png

実行すると、Elasitcsearchのデータに文ベクトルが入ります。
Kibanaでインデックスパターンを定義してDiscoverで確認すると、以下のようなデータが確認できるはずです。
ちなみに、kibanaからは何故か型が「unknown」となってしまいます。
スクリーンショット 2020-08-29 2.08.19.png

Discoverではこんな感じで「vector」がずらっと...
スクリーンショット 2020-08-29 2.19.50.png

ちなみに、さらっと重要な事を書きますが、Elasticsearchでベクトルデータを検索する場合は、検索範囲の対象データ全てにベクトルデータが埋め込まれていないと検索自体がエラーになります。

Kibanaと連携して検索する

まずは、検索結果を出力する処理をNode-REDで作成します。
httpでidにElasticsearchのドキュメントIDを渡してもらい、そのドキュメントの「タイトル」について通常のTF-IDF検索とCos類似度検索およびTF-IDF*Cos類似度の結果をHTMLで出力させます。
スクリーンショット 2020-08-29 19.46.17.png

フローを書き出したJSONは以下です。※ElasticsearchのIPアドレスや認証などはご自身の環境に合わせてください。

[{"id":"3f7774f5.a6a20c","type":"subflow","name":"Elasticsearch","info":"","category":"","in":[{"x":50,"y":30,"wires":[{"id":"c5bb15c.4c10de8"}]}],"out":[{"x":580,"y":30,"wires":[{"id":"19d2a3b.d048a5c","port":0}]}],"env":[],"color":"#E6E0F8","icon":"font-awesome/fa-database"},{"id":"eeda88c1.f8b278","type":"http request","z":"3f7774f5.a6a20c","name":"Elasticsearch","method":"use","ret":"txt","paytoqs":false,"url":"","tls":"","persist":false,"proxy":"","authType":"basic","x":340,"y":30,"wires":[["19d2a3b.d048a5c"]]},{"id":"c5bb15c.4c10de8","type":"change","z":"3f7774f5.a6a20c","name":"msg.headers","rules":[{"t":"delete","p":"headers","pt":"msg"},{"t":"set","p":"headers","pt":"msg","to":"{\"Content-Type\":\"application/json\",\"Connection\":\"close\"}","tot":"json"}],"action":"","property":"","from":"","to":"","reg":false,"x":170,"y":30,"wires":[["eeda88c1.f8b278"]]},{"id":"19d2a3b.d048a5c","type":"json","z":"3f7774f5.a6a20c","name":"","property":"payload","action":"","pretty":false,"x":490,"y":30,"wires":[[]]},{"id":"61cd4038.0ba17","type":"template","z":"22758675.f9ceca","name":"ES query (TF-IDF * CosSim)","field":"payload","fieldType":"msg","format":"handlebars","syntax":"mustache","template":"{\n    \"query\": {\n        \"function_score\": {\n            \"query\": {\n                \"query_string\": {\n                    \"query\": \"{{qtokens}}\"\n                }\n            },\n            \"functions\": [\n                {\n                    \"script_score\": {\n                        \"script\": {\n                            \"source\": \"cosineSimilarity(params.query_vector,'vector') + 1.0\",\n                            \"params\": {\n                                \"query_vector\": [{{{result.vector}}}]\n                            }\n                        }\n                    }\n                }\n            ],\n            \"score_mode\": \"multiply\",\n            \"boost_mode\": \"multiply\",\n            \"min_score\": 1\n        }\n    }\n}","output":"str","x":1100,"y":1220,"wires":[["f7cafd31.cbdc3"]]},{"id":"5eaa2de3.fa9d24","type":"http in","z":"22758675.f9ceca","name":"","url":"/similar","method":"get","upload":false,"swaggerDoc":"","x":110,"y":1060,"wires":[["247e675e.058f38"]]},{"id":"247e675e.058f38","type":"change","z":"22758675.f9ceca","name":"GET newsrss/_doc/id","rules":[{"t":"set","p":"method","pt":"msg","to":"GET","tot":"str"},{"t":"set","p":"url","pt":"msg","to":"http://localhost:30920/newsrss/_doc/","tot":"str"},{"t":"change","p":"url","pt":"msg","from":"$","fromt":"re","to":"payload.id","tot":"msg"}],"action":"","property":"","from":"","to":"","reg":false,"x":310,"y":1060,"wires":[["4e555cd7.063aa4"]]},{"id":"4e555cd7.063aa4","type":"subflow:3f7774f5.a6a20c","z":"22758675.f9ceca","name":"","x":510,"y":1060,"wires":[["22efb4bd.b72e4c"]]},{"id":"c91e69e7.83dc58","type":"change","z":"22758675.f9ceca","name":"tokens退避","rules":[{"t":"move","p":"payload.tokens","pt":"msg","to":"tokens","tot":"msg"}],"action":"","property":"","from":"","to":"","reg":false,"x":1190,"y":1060,"wires":[["ad512cf1.3884d"]]},{"id":"1facca89.7431a5","type":"template","z":"22758675.f9ceca","name":"ES query","field":"payload","fieldType":"msg","format":"json","syntax":"mustache","template":"{\n  \"analyzer\": \"kuromoji\",\n  \"text\": \"{{{result.title}}}\"\n}","output":"str","x":860,"y":1060,"wires":[["632e4f6b.d5e06"]]},{"id":"632e4f6b.d5e06","type":"subflow:3f7774f5.a6a20c","z":"22758675.f9ceca","name":"","env":[],"x":1020,"y":1060,"wires":[["c91e69e7.83dc58"]]},{"id":"22efb4bd.b72e4c","type":"change","z":"22758675.f9ceca","name":"POST _analyze","rules":[{"t":"set","p":"method","pt":"msg","to":"POST","tot":"str"},{"t":"set","p":"url","pt":"msg","to":"http://localhost:30920/_analyze","tot":"str"},{"t":"move","p":"payload._source","pt":"msg","to":"result","tot":"msg"}],"action":"","property":"","from":"","to":"","reg":false,"x":690,"y":1060,"wires":[["1facca89.7431a5"]]},{"id":"ad512cf1.3884d","type":"function","z":"22758675.f9ceca","name":"Query Tokens","func":"\nmsg.qtokens = \"\";\n\nfor(var i=0;i<msg.tokens.length;i++){\n    msg.qtokens += msg.tokens[i].token + \" \";\n}\n\nreturn msg;","outputs":1,"noerr":0,"x":1380,"y":1060,"wires":[["67390762.bcfe88"]]},{"id":"6161be2e.9dcb2","type":"change","z":"22758675.f9ceca","name":"POST newsrss/_search","rules":[{"t":"set","p":"method","pt":"msg","to":"POST","tot":"str"},{"t":"set","p":"url","pt":"msg","to":"http://localhost:30920/newsrss/_search","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":840,"y":1220,"wires":[["61cd4038.0ba17"]]},{"id":"f7cafd31.cbdc3","type":"subflow:3f7774f5.a6a20c","z":"22758675.f9ceca","name":"","x":1320,"y":1220,"wires":[["8258234e.89eef"]]},{"id":"fff61a04.80b808","type":"http response","z":"22758675.f9ceca","name":"","statusCode":"200","headers":{"content-type":"text/html"},"x":1740,"y":1220,"wires":[]},{"id":"1d8f253d.c65a0b","type":"template","z":"22758675.f9ceca","name":"HTML","field":"payload","fieldType":"msg","format":"handlebars","syntax":"mustache","template":"<!doctype html>\n<html lang=\"ja\">\n    <head>\n        <meta charset=\"utf-8\">\n    </head>\n    <body>\n        TF-IDF(デフォルト)<br>\n        <table border=\"1\">\n            <tr>\n                <th>スコア</th>\n                <th>タイトル</th>\n            </tr>\n            {{{out1html}}}\n        </table>\n        Cos類似度+1.0<br>\n        <table border=\"1\">\n            <tr>\n                <th>スコア</th>\n                <th>タイトル</th>\n            </tr>\n            {{{out2html}}}\n        </table>\n        TF-IDF * (Cos類似度+1.0)<br>\n        <table border=\"1\">\n            <tr>\n                <th>スコア</th>\n                <th>タイトル</th>\n            </tr>\n            {{{out3html}}}\n        </table>\n    </body>\n</html>","output":"str","x":1600,"y":1220,"wires":[["fff61a04.80b808"]]},{"id":"8258234e.89eef","type":"function","z":"22758675.f9ceca","name":"TABLE","func":"\nmsg.out3html = \"\";\n\nfor(var i=0;i<msg.payload.hits.hits.length;i++){\n    msg.out3html += \"<tr>\";\n    msg.out3html += \"<td>\" + msg.payload.hits.hits[i]._score + \"</td>\";\n    msg.out3html += \"<td>\" + msg.payload.hits.hits[i]._source.title + \"</td>\";\n    msg.out3html += \"</tr>\\n\";\n}\n\nreturn msg;","outputs":1,"noerr":0,"x":1470,"y":1220,"wires":[["1d8f253d.c65a0b"]]},{"id":"67390762.bcfe88","type":"change","z":"22758675.f9ceca","name":"POST newsrss/_search","rules":[{"t":"set","p":"method","pt":"msg","to":"POST","tot":"str"},{"t":"set","p":"url","pt":"msg","to":"http://localhost:30920/newsrss/_search","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":840,"y":1120,"wires":[["fdae635a.428ff"]]},{"id":"fdae635a.428ff","type":"template","z":"22758675.f9ceca","name":"ES query (TF-IDF)","field":"payload","fieldType":"msg","format":"handlebars","syntax":"mustache","template":"{\n  \"query\": {\n    \"query_string\": {\n      \"query\": \"{{qtokens}}\"\n    }\n  }\n}","output":"str","x":1070,"y":1120,"wires":[["e8a572e4.7c11e"]]},{"id":"e8a572e4.7c11e","type":"subflow:3f7774f5.a6a20c","z":"22758675.f9ceca","name":"","x":1260,"y":1120,"wires":[["5d9bafc3.7829b"]]},{"id":"5d9bafc3.7829b","type":"function","z":"22758675.f9ceca","name":"TABLE","func":"\nmsg.out1html = \"\";\n\nfor(var i=0;i<msg.payload.hits.hits.length;i++){\n    msg.out1html += \"<tr>\";\n    msg.out1html += \"<td>\" + msg.payload.hits.hits[i]._score + \"</td>\";\n    msg.out1html += \"<td>\" + msg.payload.hits.hits[i]._source.title + \"</td>\";\n    msg.out1html += \"</tr>\\n\";\n}\n\nreturn msg;","outputs":1,"noerr":0,"x":1410,"y":1120,"wires":[["80fae3a3.bc9d3"]]},{"id":"7d23737a.f6b10c","type":"template","z":"22758675.f9ceca","name":"ES query (CosSim)","field":"payload","fieldType":"msg","format":"handlebars","syntax":"mustache","template":"{\n    \"query\": {\n        \"function_score\": {\n            \"query\": {\n                \"match_all\": {}\n            },\n            \"functions\": [\n                {\n                    \"script_score\": {\n                        \"script\": {\n                            \"source\": \"cosineSimilarity(params.query_vector,'vector') + 1.0\",\n                            \"params\": {\n                                \"query_vector\": [{{{result.vector}}}]\n                            }\n                        }\n                    }\n                }\n            ],\n            \"score_mode\": \"multiply\",\n            \"boost_mode\": \"multiply\",\n            \"min_score\": 1\n        }\n    }\n}","output":"str","x":1070,"y":1170,"wires":[["58f2114a.ff0ca"]]},{"id":"80fae3a3.bc9d3","type":"change","z":"22758675.f9ceca","name":"POST newsrss/_search","rules":[{"t":"set","p":"method","pt":"msg","to":"POST","tot":"str"},{"t":"set","p":"url","pt":"msg","to":"http://localhost:30920/newsrss/_search","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":840,"y":1170,"wires":[["7d23737a.f6b10c"]]},{"id":"58f2114a.ff0ca","type":"subflow:3f7774f5.a6a20c","z":"22758675.f9ceca","name":"","x":1260,"y":1170,"wires":[["7b231146.dbfc6"]]},{"id":"7b231146.dbfc6","type":"function","z":"22758675.f9ceca","name":"TABLE","func":"\nmsg.out2html = \"\";\n\nfor(var i=0;i<msg.payload.hits.hits.length;i++){\n    msg.out2html += \"<tr>\";\n    msg.out2html += \"<td>\" + msg.payload.hits.hits[i]._score + \"</td>\";\n    msg.out2html += \"<td>\" + msg.payload.hits.hits[i]._source.title + \"</td>\";\n    msg.out2html += \"</tr>\\n\";\n}\n\nreturn msg;","outputs":1,"noerr":0,"x":1410,"y":1170,"wires":[["6161be2e.9dcb2"]]},{"id":"49875731.aed0e8","type":"comment","z":"22758675.f9ceca","name":"kibana連携URL","info":"","x":110,"y":1010,"wires":[]}]

functionノードは大した事をしていないので、Cos類似度検索のElasticsearchのクエリがどうなっているかだけ説明しておきます。
といっても、よく見かけるscript_scoreとなり、負の値が扱えないので+1.0をしています。検索するベクトル値の代入ですが、Node-REDだとtemplateノードで配列のオブジェクトを渡すだけだったのでめちゃくちゃ楽でした...
スクリーンショット 2020-08-29 19.59.20.png

次はKibanaのインデックスパターンで「_id」をURLにします。作成したNode-REDのURLがうまく呼び出せるよう設定してください。
スクリーンショット 2020-08-29 19.49.23.png

これで、Discoverなどから_idのリンクをクリックすると、Node-REDの検索処理結果が出せるようになります。お手軽にhttpのエンドポイントを作れるNode-REDとKibanaのインデックスパターンの連携もかなり相性良いですね。
スクリーンショット 2020-08-29 20.08.02.png

良さそうな出力結果は以下です。上位の3件を見て欲しいのですが、各々の検索の特徴が出ていると思います。(説明できませんが...)
類似度が高すぎるので、やはりデータ量の少なさから類似度の精度が悪いですね。

スクリーンショット 2020-08-29 20.10.00.png

おわりに

実際に使い物になるレベルにするにはまだまだ問題が多いですが、パブリッククラウドに不安を感じてたり、社内の閉じた環境などでは参考になるのではないでしょうか。
分かち書き時のフィルタや、ユーザ辞書・類似語辞書もElasticsearchでカスタマイズできるので、かなりコンパクトにまとまった自然言語処理の機械学習セットではないかと思います。

1
2
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?