More than 5 years have passed since last update.

Apache SparkのDataFrameでrow_numberを実行する方法

Last updated at 2015-09-08Posted at 2015-09-07

SQLのWindow関数の row_number 便利ですよね。
Apache Sparkの DataFrame でも 1.4.0 以降なら row_number 使えます

DataFrameのサンプル

version	name
1.0	Apple Pie
1.1	Banana Bread
1.5	Cupcake
1.6	Donut
2.0	Eclair
2.1	Froyo
2.3	Gingerbread
3	Honeycomb
4.0	Ice Cream Sandwich
4.3	Jelly Bean
4.4	KitKat

row_number

org.apache.spark.sql.expressions.Window をimportして rowNumber().over() に渡します。

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

val identified = df.select(
    rowNumber().over( Window.partitionBy().orderBy() ) as "id",
    $"version",
    $"name",
)

↑みたいに partitionBy に引数を渡さないと全データに対して通番がふられます。

identified.take(5)
res1: Array[org.apache.spark.sql.Row] = Array([1,"1.0","Apple Pie"], [2,"1.1","Banana Bread"], [3,"1.5","Cupcake"], [4,"1.6","Donut"], [5,"2.0","Eclair"])

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up