0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

scala sparkの勉強備忘録② カラム周り

Last updated at Posted at 2021-08-10

引き続き勉強中

schema

val fileSchema = StructType(Array(StructField("col1", IntegerType, true),
 StructField("col2", IntegerType, true),
 StructField("col3", IntegerType, true),
 StructField("col4", StringType, true),
 StructField("col5", IntegerType, true),
 StructField("col6", Integer, true)))

val df = spark.read.schema(fileschema).option("header", "true").csv(path/filename.csv)
# カラム数の多い時はspark側に推測させるのではなくこんな感じで先にschemaを定義したほうが良いらしい
# ただ以下のようなエラーが出てできなかった、原因調査中
<console>:31: error: class java.lang.Integer is not a value

下のでもいけるみたい

val schema = "col1 int, col2 int, col3 string, col4 int, col5 int"
val df = spark.read.schema(schema).option("header",true).csv("path/filename.csv")

カラム名の変更、カラム追加、drop

# rename
scala> dataDF.withColumnRenamed("name", "namae").show()
+------+---+----------+
| namae|age|  birthday|
+------+---+----------+
|Brooke| 20|2001-06-19|
|Brooke| 25|1996-07-25|
| Denny| 31|1990-08-16|
| Jules| 30|1991-05-25|
|    TD| 35|1986-07-26|
+------+---+----------+

# カラム追加と文字列を日付に
scala> dataDF.withColumn("birthdate", to_timestamp(col("birthday"), "yyyy-MM-dd")).withColumn("age5year", expr("age+5")).show()
+------+---+----------+-------------------+--------+
|  name|age|  birthday|          birthdate|age5year|
+------+---+----------+-------------------+--------+
|Brooke| 20|2001-06-19|2001-06-19 00:00:00|      25|
|Brooke| 25|1996-07-25|1996-07-25 00:00:00|      30|
| Denny| 31|1990-08-16|1990-08-16 00:00:00|      36|
| Jules| 30|1991-05-25|1991-05-25 00:00:00|      35|
|    TD| 35|1986-07-26|1986-07-26 00:00:00|      40|
+------+---+----------+-------------------+--------+
#withColumnをつけた数だけカラム追加できる
# to_timestampは一つ目の引数で指定した文字列のどこがどれかを二つ目の引数で指定する
scala> dataDF.withColumn("birthdate", to_timestamp(col("birthday"), "yyyy-mm-ss")).withColumn("age5year", expr("age+5")).show()
+------+---+----------+-------------------+--------+
|  name|age|  birthday|          birthdate|age5year|
+------+---+----------+-------------------+--------+
|Brooke| 20|2001-06-19|2001-01-01 00:06:19|      25|
|Brooke| 25|1996-07-25|1996-01-01 00:07:25|      30|
| Denny| 31|1990-08-16|1990-01-01 00:08:16|      36|
| Jules| 30|1991-05-25|1991-01-01 00:05:25|      35|
|    TD| 35|1986-07-26|1986-01-01 00:07:26|      40|
+------+---+----------+-------------------+--------+

# drop
scala> dataDF.drop("age").show()
+------+----------+
|  name|  birthday|
+------+----------+
|Brooke|2001-06-19|
|Brooke|1996-07-25|
| Denny|1990-08-16|
| Jules|1991-05-25|
|    TD|1986-07-26|
+------+----------+

カラム追加する時に気をつけること

同じ名前でdataframe作ってカラム追加しようとすると怒られるのでカラム追加する度に違う名前にしないとダメそう

scala> val dataDF = dataDF.withColumn("birthdate", to_timestamp(col("birthday"), "yyyy-MM-dd")).withColumn("age5year", expr("age+5"))
<console>:32: error: recursive value dataDF needs type

# これならいける
scala> val dataDF = dataDF.withColumn("birthdate", to_timestamp(col("birthday"), "yyyy-MM-dd")).withColumn("age5year", expr("age+5"))
<console>:32: error: recursive value dataDF needs type

日付型の一部だけ持ってくる

scala> dataDFdate.select(year(col("birthdate"))).show()
+---------------+
|year(birthdate)|
+---------------+
|           2001|
|           1996|
|           1990|
|           1991|
|           1986|
+---------------+

scala> dataDFdate.select(month(col("birthdate"))).show()
+----------------+
|month(birthdate)|
+----------------+
|               6|
|               7|
|               8|
|               5|
|               7|
+----------------+

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?