SQLのWindow関数の row_number 便利ですよね。
Apache Sparkの DataFrame でも 1.4.0 以降なら row_number 使えます ![]()
DataFrameのサンプル
| version | name |
|---|---|
| 1.0 | Apple Pie |
| 1.1 | Banana Bread |
| 1.5 | Cupcake |
| 1.6 | Donut |
| 2.0 | Eclair |
| 2.1 | Froyo |
| 2.3 | Gingerbread |
| 3 | Honeycomb |
| 4.0 | Ice Cream Sandwich |
| 4.3 | Jelly Bean |
| 4.4 | KitKat |
row_number
org.apache.spark.sql.expressions.Window をimportして rowNumber().over() に渡します。
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val identified = df.select(
rowNumber().over( Window.partitionBy().orderBy() ) as "id",
$"version",
$"name",
)
↑みたいに partitionBy に引数を渡さないと全データに対して通番がふられます。
identified.take(5)
res1: Array[org.apache.spark.sql.Row] = Array([1,"1.0","Apple Pie"], [2,"1.1","Banana Bread"], [3,"1.5","Cupcake"], [4,"1.6","Donut"], [5,"2.0","Eclair"])