SQLのWindow関数の row_number
便利ですよね。
Apache Sparkの DataFrame
でも 1.4.0
以降なら row_number
使えます
DataFrameのサンプル
version | name |
---|---|
1.0 | Apple Pie |
1.1 | Banana Bread |
1.5 | Cupcake |
1.6 | Donut |
2.0 | Eclair |
2.1 | Froyo |
2.3 | Gingerbread |
3 | Honeycomb |
4.0 | Ice Cream Sandwich |
4.3 | Jelly Bean |
4.4 | KitKat |
row_number
org.apache.spark.sql.expressions.Window
をimportして rowNumber().over()
に渡します。
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val identified = df.select(
rowNumber().over( Window.partitionBy().orderBy() ) as "id",
$"version",
$"name",
)
↑みたいに partitionBy
に引数を渡さないと全データに対して通番がふられます。
identified.take(5)
res1: Array[org.apache.spark.sql.Row] = Array([1,"1.0","Apple Pie"], [2,"1.1","Banana Bread"], [3,"1.5","Cupcake"], [4,"1.6","Donut"], [5,"2.0","Eclair"])