dataframeの結合
dataDF.join(dataDFother, dataDF("name") === dataDFother("name"), "inner").show()
結合の方法にはよくあるinnerやleftouter, rightouter, fullouterの他にleftsemiやleftantiがある
どのようなものかは以下の通り
scala> dataDF.show()
+------+---+----------+
| name|age| birthday|
+------+---+----------+
|Brooke| 20|2001-06-19|
| Bake| 25|1996-07-25|
| Denny| 31|1990-08-16|
| Jules| 30|1991-05-25|
| TD| 35|1986-07-26|
+------+---+----------+
scala> dataDFother.show()
+------+---+----------+
| name|age| birthday|
+------+---+----------+
|Brooke| 20|2001-06-19|
| Bake| 25|1996-07-25|
| Denny| 31|1990-08-17|
| Jon| 30|1991-07-25|
| AK| 35|1986-07-29|
+------+---+----------+
scala> dataDF.join(dataDFother, dataDF("name") === dataDFother("name"), "leftsemi").show()
+------+---+----------+
| name|age| birthday|
+------+---+----------+
|Brooke| 20|2001-06-19|
| Bake| 25|1996-07-25|
| Denny| 31|1990-08-16|
+------+---+----------+
# leftsimiは左右に共通するkeyを持つ行の左側データのみを出す
scala> dataDF.join(dataDFother, dataDF("name") === dataDFother("name"), "leftanti").show()
+-----+---+----------+
| name|age| birthday|
+-----+---+----------+
|Jules| 30|1991-05-25|
| TD| 35|1986-07-26|
+-----+---+----------+
# leftantiは左側のデータのうち左右で共通しない行を出す
この様にleftsemiとleftantiは結合というより抽出の様なイメージである
scala> dataDF.join(dataDFother, dataDF("name") === dataDFother("name") and dataDF("birthday") === dataDFother("birthday"), "inner").show()
+------+---+----------+------+---+----------+
| name|age| birthday| name|age| birthday|
+------+---+----------+------+---+----------+
|Brooke| 20|2001-06-19|Brooke| 20|2001-06-19|
| Bake| 25|1996-07-25| Bake| 25|1996-07-25|
+------+---+----------+------+---+----------+
# keyを複数にするときはandで繋ぐ
scala> dataDF.join(dataDFother, dataDF("name") === dataDFother("name") or dataDF("birthday") === dataDFother("birthday"), "inner").show()
+------+---+----------+------+---+----------+
| name|age| birthday| name|age| birthday|
+------+---+----------+------+---+----------+
|Brooke| 20|2001-06-19|Brooke| 20|2001-06-19|
| Bake| 25|1996-07-25| Bake| 25|1996-07-25|
| Denny| 31|1990-08-16| Denny| 31|1990-08-17|
+------+---+----------+------+---+----------+
#orでもできる
scala> dataDF.join(dataDFother, dataDF("name") === dataDFother("name") & dataDF("birthday") === dataDFother("birthday"), "inner").show()
<console>:35: error: value & is not a member of org.apache.spark.sql.Column
dataDF.join(dataDFother, dataDF("name") === dataDFother("name") & dataDF("birthday") === dataDFother("birthday"), "inner").show()
#記号じゃダメらしい