More than 5 years have passed since last update.

mysql のつもりで elasticsearch 6.3 の join をやったらベストプラクティスが泥臭かった話

Posted at 2019-10-30

はじめに

いろいろ間違ってると思うので信用しないでくださいませ

目的

mysqlでいうと「table A, B, C をjoinしたい」

わかった制約

そもそもtableを別にできない
- 全部同じtable(index)に入れる必要があるので、カラム名の名前かぶりが許されない。
  - dataの最上位階層はtable名の配列にしておくといい

# かぶってしまう
data_a = {
    { "id" : ...}     // indexに入れる
}
data_b = {
    { "id" : ...}     // すでにindexにmappingされてるのでaのidと型が違うとエラー
}

# こうすればかぶらない
data = {
    "a" : { "id" : ...},
    "b" : { "id" : ...},
    "c" : { "id" : ...}
}

mysqlのjoinと根本的に違う
- mysql : left join on でjoinしたい値を選べる
- es : 親テーブルの primary key とだけjoinできる
- mysql : 入れたdataをselect時につなげてくれるのがleft join
- es : 親も子も、dataをinsertする時点で「oとxをjoinします」決め打ち
1:Nのjoinができる
- しかし N:1 にすることはできない。親は絶対。
- 最初に親子関係を決めないといけないため
a.aid == b.aid　でjoinできるが、 b.bid == c.bid してもaと繋がらない
- 3つめのcもa.aidと同じ値を持っている必要がある。
  - joinは _id でしかできないので、 a=b=c になる値を各tableが持ってないといけない

詳細の前に結論ベストプラクティス

joinは使わないで、dataの中身を各tableがupdateする仕組みがベストっぽい（a, b, c のdocument 3つを使うんじゃなくて、a, b, cがまとまった1つのdocumentで使う）

このベストプラクティスは、最後に書く。

joinの仕方 (ver 6.32)

この構成でjoinする

親 : myhost
子 : myuser
子 : mycost

kibanaのdev toolsで順番に叩けばいくはず（elasticsearch v6.3. v7はコケます


# まず消す
DELETE mydata

# 親子関係を定義する
PUT mydata
{
  "mappings": {                <--------　ここらへんはもうおまじない（やめて石投げないで
    "_doc": {
      "properties": {
        "my_join_field": {                 <-------- このjoin用カラムを "全レコード" にもたせて無理やりつなげます。つらい
          "type": "join",
          "relations": {
            "myhost": ["myuser", "mycost"]       <-----親子関係
          }
        }
      }
    }
  }
}

# data "myhost"
PUT mydata/_doc/hostA?refresh            <----- documentのユニークID `_id` に hostA を指定している
{
  "myhost": {                          <--------- recordが持つデータの中身。myuserやmycostが同じ名前のkeyを持っていたらバッティングするので、myhost連想配列の中に並べて防止
    "hostname": "hostA",
    "project": "projectA",
    "exist": "1"
  },
  "my_join_field": {
    "name": "myhost"             <----- このdocumentがmyhostであると宣言している。mappingsのrelationsに紐づく。 
  }
}

# data "myuser"
PUT mydata/_doc/2?routing=hostA&refresh      <----- ユニークIDは2で登録。_id=hostA のdocumentと同じ場所にindexingするようにroutingする
{
  "myuser": {                          <--------- recordが持つデータの中身。
    "username": "sato-san",
    "email": "sato@example.com"
  },
  "my_join_field": {
    "name": "myuser",   <----- このdocumentがmyuserであると宣言している。mappingsのrelationsに紐づく。
    "parent": "hostA"     <----- join先(親=myhost)の _id は hostAを指定。LEFT JOIN b on a._id = "hostA" という感じ
  }
}

# data "mycost"
PUT mydata/_doc/3?routing=hostA&refresh    <----- ユニークIDは3で登録。_id=hostA のdocumentと同じ場所にindexingするようにroutingする
{
  "mycost": {                          <--------- recordが持つデータの中身。
    "monthly": "1000000",
    "currency": "yen"
  },
  "my_join_field": {
    "name": "mycost",  <----- このdocumentがmycostであると宣言している。mappingsのrelationsに紐づく。
    "parent": "hostA"   <----- join先(親=myhost)の _id は hostAを指定。LEFT JOIN b on a._id = "hostA" という感じ
  }
}

routingについてはまだぐぐってないのでわかってない。とりあえずちゃんと設定しないとjoinされないことはわかった（ハマッタカラネ）

dataを全検索（残念バージョン）

# show all  - 3 documentsが並列に出力される
GET mydata/_search
{
  "query": {
          "match_all": {}
  }
}

全検索結果（残念バージョン）

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,               <-----  3 documents が hit
    "max_score": 1,
    "hits": [                   <----- joinされず、この中にバラバラデータが返ってくる
      {
        "_index": "mydata",
        "_type": "_doc",
        "_id": "hostA",          <-----  親の _id
        "_score": 1,
        "_source": {
          "myhost": {                 <-----  親の myhost data
            "hostname": "hostA",
            "project": "projectA",
            "exist": "1"
          },
          "my_join_field": {
            "name": "myhost"
          }
        }
      },
      {
        "_index": "mydata",
        "_type": "_doc",
        "_id": "3",
        "_score": 1,
        "_routing": "hostA",
        "_source": {
          "mycost": {                  <-----  子の mycost data
            "monthly": "1000000",
            "currency": "yen"
          },
          "my_join_field": {
            "name": "mycost",
            "parent": "hostA"
          }
        }
      },
      {
        "_index": "mydata",
        "_type": "_doc",
        "_id": "2",
        "_score": 1,
        "_routing": "hostA",
        "_source": {
          "myuser": {                  <-----  子の myuser data
            "username": "sato-san",
            "email": "sato@example.com"
          },
          "my_join_field": {
            "name": "myuser",
            "parent": "hostA"
          }
        }
      }
    ]
  }
}

joinしてselectする（ナイス）

has_childでinner_hitsすればデータを階層化することができた。
これだとchildがいないデータはhitしない。最初に適当なchildを必ず突っ込むようにすればいっか、と思ったとこで終わり。

# SELECT join query
GET mydata/_search
{
  "query": {
    "bool" : {
      "should" : [
        {
          "has_child": {
            "type": "myuser",
            "query": {
                "match_all": {}
            },
            "inner_hits": {}
          }
        },
        {
          "has_child": {
            "type": "mycost",
            "query": {
                "match_all": {}
            },
            "inner_hits": {}
          }
        }
      ]
    }
  },
  "sort": ["_id"]
}

join結果

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,         <-----  1 documents が hit。joinされた！
    "max_score": null,
    "hits": [
      {
        "_index": "mydata",
        "_type": "_doc",
        "_id": "hostA",
        "_score": null,
        "_source": {
          "myhost": {                    <---親データ
            "hostname": "hostA",
            "project": "projectA",
            "exist": "1"
          },
          "my_join_field": {
            "name": "myhost"
          }
        },
        "sort": [
          "hostA"
        ],
        "inner_hits": {                <---このarrayが子データの塊
          "myuser": {         <--- 子データ1の塊。
            "hits": {
              "total": 1,
              "max_score": 1,
              "hits": [               <------ もし1:Nで子データが複数ヒットすればこの配列に増える
                {
                  "_index": "mydata",
                  "_type": "_doc",
                  "_id": "2",
                  "_score": 1,
                  "_routing": "hostA",
                  "_source": {
                    "myuser": {                     <--- 子データ1。
                      "username": "sato-san",
                      "email": "sato@example.com"
                    },
                    "my_join_field": {
                      "name": "myuser",
                      "parent": "hostA"
                    }
                  }
                }
              ]
            }
          },
          "mycost": {                            <--- 子データ2の塊
            "hits": {
              "total": 1,
              "max_score": 1,
              "hits": [                         <--- histした子データ2の配列
                {
                  "_index": "mydata",
                  "_type": "_doc",
                  "_id": "3",
                  "_score": 1,
                  "_routing": "hostA",
                  "_source": {
                    "mycost": {                         <--- 子データ2
                      "monthly": "1000000",
                      "currency": "yen"
                    },
                    "my_join_field": {
                      "name": "mycost",
                      "parent": "hostA"
                    }
                  }
                }
              ]
            }
          }
        }
      }
    ]
  }
}

これを調べていたら「jsonの可読性の悪さ半端ないよ」という金言をいただきました

ベストプラクティス

table a, b, c を 1 documentに入れます。
どうせinsert時にjoin指定しないといけないんだから、joinされた状態のdataをapp側で投げればいいのです。

insert

mappingはしなくてもいいのでいきなりデータ入れます

# まず消す
DELETE mybest

# data "host" をインサート
POST mybest/_doc/hostA?refresh        <------ _id=hostA のdocumentを作成
{
  "host": {                             <------ 親データ
    "hostname": "hostA",
    "project": "projectA",
    "exist": "1"
  }
}

# data "user" をさっきのデータに足す（アップデート）
POST mybest/_doc/hostA/_update?refresh          <------ _id=hostA のdocumentをupdate
{
  "doc" : {
    "user": {                    <------ 子データ
      "username": "sato-san",
      "email": "sato@example.com"
    }
  }
}

# data "cost" をさっきのデータに足す（アップデート）
POST mybest/_doc/hostA/_update?refresh        <------ _id=hostA のdocumentをupdate
{
  "doc" : {
    "cost": {                              <------ 子データ
      "monthly": "1000000",
      "currency": "yen"
    }
  }
}

もちろん最初からまとめてinsertできます。updateの例を残したかったので。

selectする

joinとかせずに全検索します

# show all
GET mybest/_search
{
  "query": {
          "match_all": {}
  }
}

結果

joinよりよっぽどきれいな感じで取れました

{
  "took": 47,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,                 <--------- document 1つなので当然hit数=1
    "max_score": 1,
    "hits": [
      {
        "_index": "mybest",
        "_type": "_doc",
        "_id": "hostA",
        "_score": 1,
        "_source": {
          "host": {                  <--------- 親
            "hostname": "hostA",
            "project": "projectA",
            "exist": "1"
          },
          "user": {                          <--------- 子1
            "email": "sato@example.com",
            "username": "sato-san"
          },
          "cost": {                                   <--------- 子2
            "monthly": "1000000",
            "currency": "yen"
          }
        }
      }
    ]
  }
}

おまけ versionによるapiの違い

mysqlでいう select文、 insert文をesでは api　と呼ぶ。
insert文 = index api
https://www.elastic.co/guide/en/elasticsearch/reference/6.3/docs-index_.html

update文 = update api
https://www.elastic.co/guide/en/elasticsearch/reference/7.4/docs-update.html

versionによってこのapi endpointが違う。

# es6系は _docがあるが
POST test/_doc/1/_update

# es7系はない
POST test/_update/1

公式ドキュメントはここを変えると、対象versionを変えられる（今までググり直していた・・・・）

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up