6
5

More than 5 years have passed since last update.

Apache SparkのDataFrameでrow_numberを実行する方法

Last updated at Posted at 2015-09-07

SQLのWindow関数の row_number 便利ですよね。
Apache Sparkの DataFrame でも 1.4.0 以降なら row_number 使えます :smiley:

DataFrameのサンプル

version name
1.0 Apple Pie
1.1 Banana Bread
1.5 Cupcake
1.6 Donut
2.0 Eclair
2.1 Froyo
2.3 Gingerbread
3 Honeycomb
4.0 Ice Cream Sandwich
4.3 Jelly Bean
4.4 KitKat

row_number

org.apache.spark.sql.expressions.Window をimportして rowNumber().over() に渡します。

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

val identified = df.select(
    rowNumber().over( Window.partitionBy().orderBy() ) as "id",
    $"version",
    $"name",
)

↑みたいに partitionBy に引数を渡さないと全データに対して通番がふられます。

identified.take(5)
res1: Array[org.apache.spark.sql.Row] = Array([1,"1.0","Apple Pie"], [2,"1.1","Banana Bread"], [3,"1.5","Cupcake"], [4,"1.6","Donut"], [5,"2.0","Eclair"])
6
5
1

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
6
5