More than 5 years have passed since last update.

rlistでリストを整形してデータフレームにする

R
JSON

Last updated at 2018-08-23Posted at 2018-08-22

深くネストしたlist

野生のJSONをうっかりRに読み込んだらめっちゃネストした複雑なリストが出来上がってしまった…というのはよくある話です。Rでリストを扱うのは一般に憂鬱な作業ですが、rlistパッケージを使えばdplyrを使ってデータフレームを扱うが如くの容易さで操作ができます。rlistの基本的な機能については公式のチュートリアルを確認することをお勧めしますが、長かったので少し短めにまとめたものもあります。

今回は野生のJSONをハンドリングする練習として、rlistパッケージに含まれているデータセットのnyweatherを例に各種項目をデータフレームに整形する作業をやっていきます。

nyweatherにはOpenWeatherMapのAPIを用いて取得されたニューヨークの毎時気象データが入っています。データは1日ごとにリストになっており、さらにその中に毎時データがリストとして入っており、その中にさらに各測定項目が…という深いネストを伴った構造になっています。

まずトップレベルの名前を確認してみます。

library(rlist)
head(names(nyweather))

[1] "D20130101" "D20130102" "D20130103" "D20130104" "D20130105" "D20130106"

トップレベルはデータの取得日になっています。このように名前部分にデータが入ってしまっているということもしばしばあります。nyweatherの場合は内部に別途日時が記録されているのでここからデータを抽出する必要はないのですが、名前からのデータ抽出方法についても後で解説します。

1日目のデータを確認してみましょう。

str(nyweather$D20130101, max.level = 2, list.len = 6)

List of 6
 $ message : chr ""
 $ cod     : chr "200"
 $ city_id : int 5128581
 $ calctime: num 0.597
 $ cnt     : int 25
 $ list    :List of 25
  ..$ :List of 5
  ..$ :List of 5
  ..$ :List of 5
  ..$ :List of 5
  ..$ :List of 5
  ..$ :List of 5
  .. [list output truncated]

毎時データも確認してみましょう。

str(nyweather$D20130101$list[[1]])

List of 5
 $ weather:List of 1
  ..$ :List of 4
  .. ..$ id         : int 803
  .. ..$ main       : chr "Clouds"
  .. ..$ description: chr "broken clouds"
  .. ..$ icon       : chr "04d"
 $ main   :List of 5
  ..$ temp    : num 274
  ..$ pressure: int 1013
  ..$ humidity: int 50
  ..$ temp_min: num 272
  ..$ temp_max: num 276
 $ wind   :List of 3
  ..$ speed: num 10.8
  ..$ deg  : int 310
  ..$ gust : num 14.4
 $ clouds :List of 1
  ..$ all: int 75
 $ dt     : int 1356969600

項目によってはさらにネストしているものがあることが分かります。しかもこのデータ構造は取得日によって微妙に違うことがあります。日によって他の日には無いデータが取得されていたり、一部のデータが欠損していたりするのです。データが欠損している場合にnullが入っているのではなく項目自体が省略されている、というのはJSONだとありがちな構造です。

特定項目の抽出

まずは平均気温を抽出することを考えてみましょう。後先考えなければこれは比較的簡単です。ネストした項目にアクセスするには、list.map()やlist.mapv()等の要素にアクセスする関数をネストさせれば良いのです。

library(rlist)
library(pipeR) # pipeRはrlistと組み合わせて使用した際に問題が起きないように設計されている
nyweather %>>% 
  list.map( # 各日付に対する操作
    list %>>% # $listの各項目に対する操作
      list.mapv(main$temp) # main$tempの中身をベクトルとして抽出
  ) %>>%
  head(3)

$D20130101
 [1] 273.84 274.61 274.99 274.92 274.72 273.85 272.83 271.97 272.03 271.46
[11] 271.19 271.07 270.83 270.64 269.98 269.76 270.17 271.33 271.75 271.34
[21] 271.41 271.60 272.14 273.12 274.37

$D20130102
 [1] 274.37 275.37 276.43 276.61 276.27 275.56 275.50 275.83 275.76 275.83
[11] 276.03 276.03 276.09 276.60 276.80 276.66 276.80 276.73 276.67 276.73
[21] 276.65 276.60 276.80 277.17 277.17

$D20130103
 [1] 277.17 276.78 276.26 276.33 276.40 276.07 275.17 274.32 273.48 272.95
[11] 272.11 271.14 270.42 269.48 268.00 268.00 268.00 268.00 268.00 269.39
[21] 270.33 271.29

しかし、このような抽出だけで十分というケースは稀でしょう。まず時刻がわかりません。

そこで今度は時刻と平均気温を合わせて抽出することを考えてみましょう。

複数の項目を抽出する際には、list.select()を使用します。dplyrのselect()のようなものだと考えれば良いでしょう。

先程のlist.map()でも同じですが、これらの関数は リスト要素の文脈で式を評価
することで抽出という操作を行います。つまり、評価する式を操作すれば、単なる抽出だけでなく集計や変換を同時に行うこともできます。

nyweatherでは、時刻がunix timeで記録されていますから、抽出ついでにこれをPOSIXctクラスへ変換してみましょう。unix
timeから日時への変換は、lubridateパッケージを使ってlubridate::as_datetime(x, tz = "UTC")とするか、組み込みのas.POSIXct()を使ってas.POSIXct(x, origin = "1970/1/1", tz = "UTC")とすることで行えます。

nyweather %>>%
  list.map(
    list %>>%
      list.select(
        dt = lubridate::as_datetime(dt, tz = "UTC"),
        temp = main$temp
      )
  ) %>>% head(1) %>>% str(list.len = 3)

List of 1
 $ D20130101:List of 25
  ..$ :List of 2
  .. ..$ dt  : POSIXct[1:1], format: "2012-12-31 16:00:00"
  .. ..$ temp: num 274
  ..$ :List of 2
  .. ..$ dt  : POSIXct[1:1], format: "2012-12-31 17:00:00"
  .. ..$ temp: num 275
  ..$ :List of 2
  .. ..$ dt  : POSIXct[1:1], format: "2012-12-31 18:00:00"
  .. ..$ temp: num 275
  .. [list output truncated]

list.select()による抽出結果は、要素数が等しいリストになっています。このようなリストはlist.stack()でデータフレームに変換できます。データフレームも実質的にはリストなので、list.stack()の結果をさらにスタックすることもできます。

nyweather %>>%
  list.map(
    list %>>%
      list.select(
        dt = lubridate::as_datetime(dt, tz = "UTC"),
        temp = main$temp
      ) %>>%
      list.stack()
  ) %>>%
  list.stack() %>>% head

dt	temp
2012-12-31 16:00:00	273.84
2012-12-31 17:00:00	274.61
2012-12-31 18:00:00	274.99
2012-12-31 19:00:00	274.92
2012-12-31 20:00:00	274.72
2012-12-31 21:00:00	273.85

無事にデータフレームとしてデータを抽出できました。

欠損値がある場合

nyweatherに含まれる観測項目は均一ではない、つまり、日や時刻によって項目があったりなかったりするということを最初に説明しました。

$list以下の毎時データにどのような項目が含まれているのかを集計してみましょう。この用途にはlist.table()が便利です。項目名を抽出するにはnames()を使用します。

nyweather %>>%
  list.table(
    list %>>% list.map(names(.))
  )


       clouds            dt forecast_rain          main          rain 
         1263          1263            64          1263            30 
      weather          wind 
         1263          1263

一部のデータのみforecast_rainやrainが含まれているようです。forecast_rainについてどんなデータなのか確認してみましょう。

nyweather %>>%
  list.map(
    list %>>%
      list.mapv(.$forecast_rain)
  ) %>>%
  list.clean()

$D20130111
3h 3h 3h 3h 3h 3h 3h 
 0  0  0  0  0  0  0 

$D20130113
 3h  3h  3h  3h  3h  3h  3h  3h  3h  3h  3h  3h  3h  3h  3h  3h 
0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.0 0.0 0.0 0.0 

$D20130115
  3h   3h   3h   3h   3h   3h   3h   3h   3h 
0.07 0.07 0.07 0.07 0.07 0.07 0.04 0.04 0.04 

$D20130116
  3h   3h   3h   3h   3h   3h   3h 
0.04 0.04 1.10 1.10 1.10 1.10 1.10 

$D20130117
 3h  3h  3h  3h  3h  3h  3h  3h  3h  3h 
4.5 4.5 4.5 4.5 4.5 4.5 1.2 1.2 1.2 1.2 

$D20130118
 3h  3h  3h  3h  3h  3h  3h  3h  3h  3h  3h  3h  3h  3h  3h 
1.2 1.2 1.2 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4

ここで注意すべきはforecast_rainではなく.$forecast_rainとアクセスする必要があるという点です。forecast_rainはすべてのリスト要素に含まれているわけではないので、そのままアクセスするとforecast_rainを含まない要素にぶつかった際にオブジェクト 'forecast_rain' がありませんのようなエラーが出て処理が止まってしまいます。

.$forecast_rainのように$演算子を介したアクセスでは、要素が存在しなければNULLが返るため、すべてのリスト要素に対して抽出処理を実行できます。そのかわりにNULLが残ってしまいますが、これが不要ならlist.clean()で除外できます。

では、先程と同様に時刻と一緒に上記の値を抽出してみましょう。

nyweather %>>%
  list.map(
    list %>>%
      list.select(
        dt = lubridate::as_datetime(dt, tz = "UTC"),
        forecast_rain = .$forecast_rain$`3h`
      )  %>>%
      list.stack()
  ) %>>%
  list.stack()

data.table::rbindlist(.data, ...) でエラー: NULL に対して属性を設定しようとしました

はい。失敗しました。list.stack()はdata.table::rbindlist()のラッパーなのですが、結果にNULLが含まれるリストをdata.table::rbindlist()はうまく扱うことができません。

list.clean()でNULLを取り除きましょう。今回取り除くべきNULLはlist.select()の結果の一部として含まれているので、recursive=TRUEを指定する必要があります。NULLを取り除くことで部分的に列の数が変わってしまいますが、それによって生じる隙間はlist.stack()にfill = TRUEを指定することで埋めることができます。

今回の例ではデータの密度が低く、内側の処理で1列のデータフレームが生じる可能性があるので、外側で改めてlist.stack(fill = TRUE)を実行しています。

なお、隙間はNAで埋められるため、na.omit()で簡単に不要な行を削除できます。

nyweather %>>%
  list.map(
    list %>>%
      list.select(
        dt = lubridate::as_datetime(dt, tz = "UTC"),
        forecast_rain = .$forecast_rain$`3h`
      )  %>>%
      list.clean(recursive = TRUE) %>>%
      list.stack(fill = TRUE)
  ) %>>%
  list.stack(fill = TRUE) %>>%
  na.omit() %>>%  head

	dt	forecast_rain
227	2013-01-10 18:00:00	0
228	2013-01-10 22:00:00	0
229	2013-01-10 23:00:00	0
230	2013-01-11 00:00:00	0
231	2013-01-11 01:00:00	0
232	2013-01-11 03:00:00	0

keyを値として使いたい場合

ときには時刻や日付のような値がkeyとして使用されている場合があります。このようなJSONをRに読み込むと、keyはname属性として取り込まれます。したがって、これを値として使いたい場合は、name属性を取得してやればよいということになります。

nyweatherでは第一レベルのnameがデータ取得日になっているので、これを含むデータフレームを作成してみましょう。

name属性の取得方法はいくつかあります。まず、組み込み関数のnames()が使えます。

names(nyweather) %>>% head

[1] "D20130101" "D20130102" "D20130103" "D20130104" "D20130105" "D20130106"

また、rlistの関数群の中（正確にはラムダ式の中）で使用できる変数である.nameを通じてnameを取得することもできます

nyweather %>>%
  list.map(.name) %>>%
  head

$D20130101
[1] "D20130101"

$D20130102
[1] "D20130102"

$D20130103
[1] "D20130103"

$D20130104
[1] "D20130104"

$D20130105
[1] "D20130105"

$D20130106
[1] "D20130106"

names(.)と.nameの挙動の違いに注意してください。

nyweather %>>%
  list.map(names(.)) %>>%
  head

$D20130101
[1] "message"  "cod"      "city_id"  "calctime" "cnt"      "list"    

$D20130102
[1] "message"  "cod"      "city_id"  "calctime" "cnt"      "list"    

$D20130103
[1] "message"  "cod"      "city_id"  "calctime" "cnt"      "list"    

$D20130104
[1] "message"  "cod"      "city_id"  "calctime" "cnt"      "list"    

$D20130105
[1] "message"  "cod"      "city_id"  "calctime" "cnt"      "list"    

$D20130106
[1] "message"  "cod"      "city_id"  "calctime" "cnt"      "list"

.nameは操作の対象となっている要素自体のname属性を取得するのに対し、name(.)は操作の対象となっている要素に含まれているname属性を取得します。.nameは上方向を、name(.)は下方向を見ているのです。

階層の異なるデータを組み合わせる

日付が入っているname属性と時刻や気温の入っている要素は階層が違います。

このようなデータはそのままではstackできません。

nyweather %>>%
  list.select(
    date = .name,
    dt = list %>>% list.map(.$dt),
    temp = list %>>% list.map(.$main$temp)
  ) %>>%
  list.stack()

data.table::rbindlist(.data, ...) でエラー: Column 2 of item 1 is length 25, inconsistent with first column of that item which is length 1. rbind/rbindlist doesn't recycle as it already expects each item to be a uniform list, data.frame or data.table

datatable::rbindlist()は要素数の違うリストを結合できないことが原因のようです。これを解決する方法はいくつかあります。一つはlist.select()の代わりにdata.frame()を使用して先にデータフレームを作成してしまってから結合する方法です。data.frame()の要素にはベクトルを与える必要があるので、list.mapv()を使用します。

nyweather %>>%
  list.map(
    data.frame(
      date = .name,
      dt = list %>>% list.mapv(lubridate::as_datetime(dt, tz = "UTC")),
      temp = list %>>% list.mapv(main$temp)
    )
  ) %>>% 
  list.stack() %>>% head

date	dt	temp
D20130101	1356969600	273.84
D20130101	1356973200	274.61
D20130101	1356976800	274.99
D20130101	1356980400	274.92
D20130101	1356984000	274.72
D20130101	1356987600	273.85

結果をみるとわかるように、この方法だとPOSIXctクラスが保持されません（なんでだろ）。

もう一つの方法はtibble::tibble()を使用するというものです。この場合、list.map()の結果をそのまま含めることができます。ただし結果はリストのままtibbleに含まれることになるので、データフレームと同等の形式にするためにはtidyr::unnest()でリストを開いてやる必要があります。

nyweather %>>% 
  list.map(
    tibble::tibble(
      date = .name,
      dt = list %>>% list.map(lubridate::as_datetime(dt, tz = "UTC")),
      temp = list %>>% list.map(main$temp)
    )
  ) %>>% 
  list.stack() %>>% 
  tidyr::unnest() %>>% head

date	dt	temp
D20130101	2012-12-31 16:00:00	273.84
D20130101	2012-12-31 17:00:00	274.61
D20130101	2012-12-31 18:00:00	274.99
D20130101	2012-12-31 19:00:00	274.92
D20130101	2012-12-31 20:00:00	274.72
D20130101	2012-12-31 21:00:00	273.85

複雑にネストしたリストをデータフレームに変換する

tibbleを使う方法では、複雑な構造をデータフレームに変換することもできます。

例えば$mainの中身の各項目を、項目名と値の2列として取得する例を考えてみましょう。つまり、次のようなデータフレームを作成することを目的とします。

item	value
temp	273.84
temp_min	272.15
temp_max	275.93
…	…

まず次のようにリストを要素として含むtibbleを作成します。

nyweather %>>% 
  list.map(
    tibble::tibble(
      date = .name,
      item = list %>>% list.map(names(main)),
      value = list %>>% list.map(main)
    )
  ) %>>% 
  list.stack() %>>% head

# A tibble: 6 x 3
  date      item      value     
* <chr>     <list>    <list>    
1 D20130101 <chr [5]> <list [5]>
2 D20130101 <chr [5]> <list [5]>
3 D20130101 <chr [5]> <list [5]>
4 D20130101 <chr [5]> <list [5]>
5 D20130101 <chr [5]> <list [5]>
6 D20130101 <chr [5]> <list [5]>

これをtidyr::unnest()を使って開けばデータフレームが得られます。今回の例ではvalueの中身はさらにネストされたリストになっているので、tidyr::unnest()は2回実行する必要があります。

nyweather %>>% 
  list.map(
    tibble::tibble(
      date = .name,
      item = list %>>% list.map(names(main)),
      value = list %>>% list.map(main)
    )
  ) %>>%
  list.stack() %>>% 
  tidyr::unnest() %>>% 
  tidyr::unnest() %>>% head

# A tibble: 6 x 3
  date      item     value
* <chr>     <chr>    <dbl>
1 D20130101 temp      274.
2 D20130101 pressure 1013 
3 D20130101 humidity   50 
4 D20130101 temp_min  272.
5 D20130101 temp_max  276.
6 D20130101 temp      275.

tidyr::unnest()の対象となる列に含まれるリストは、それぞれ要素数が等しい必要があります。つまり、次のようなtibbleはそのままunnestすることができません。

nyweather %>>% 
  list.map(
    tibble::tibble(
      date = .name,
      item = list %>>% list.map(names(main)),
      value = list %>>% list.map(main),
      wind = list %>>% list.map(wind$speed)
    )
  ) %>>%
  list.stack() %>>% head

# A tibble: 6 x 4
  date      item      value      wind     
* <chr>     <list>    <list>     <list>   
1 D20130101 <chr [5]> <list [5]> <dbl [1]>
2 D20130101 <chr [5]> <list [5]> <dbl [1]>
3 D20130101 <chr [5]> <list [5]> <dbl [1]>
4 D20130101 <chr [5]> <list [5]> <dbl [1]>
5 D20130101 <chr [5]> <list [5]> <dbl [1]>
6 D20130101 <chr [5]> <list [5]> <dbl [1]>

この場合はwind$speedを取得する部分をlist.mapvにしてベクトルとして取得するようにするか、unnestする際に.preserve = windを指定して、unnestの対象から外すという方法があります。.preserve =を指定する方法は、順序を間違えると予期せぬデータの重複を招く場合があるので注意が必要です。

nyweather %>>% 
  list.map(
    tibble::tibble(
      date = .name,
      item = list %>>% list.map(names(main)),
      value = list %>>% list.map(main),
      wind = list %>>% list.map(wind$speed)
    )
  ) %>>%
  list.stack() %>>% 
  tidyr::unnest(.preserve = wind) %>>% head

# A tibble: 6 x 4
  date      wind      item     value    
* <chr>     <list>    <chr>    <list>   
1 D20130101 <dbl [1]> temp     <dbl [1]>
2 D20130101 <dbl [1]> pressure <int [1]>
3 D20130101 <dbl [1]> humidity <int [1]>
4 D20130101 <dbl [1]> temp_min <dbl [1]>
5 D20130101 <dbl [1]> temp_max <dbl [1]>
6 D20130101 <dbl [1]> temp     <dbl [1]>

ところで、通常のデータフレームでもI()を使用すればリストを要素に含めることができ、tidyr::unnest()で開くことができます。

ただし、RStudio上のプレビューでは各項目が<S3: AsIs>のように表示されてリストに含まれる要素数がわかりにくくなります。複数回のunnestが必要な複雑なデータを扱う場合は、展開ミスを防ぐためにtibbleを使用したほうが良いでしょう。

nyweather %>>% 
  list.map(
    data.frame(
      date = .name,
      item = I(list %>>% list.map(names(main))),
      value = I(list %>>% list.map(main))
    )
  ) %>>%
  list.stack() %>>% head

# A tibble: 6 x 3
  date      item      value     
* <fct>     <I(list)> <I(list)> 
1 D20130101 <chr [5]> <list [5]>
2 D20130101 <chr [5]> <list [5]>
3 D20130101 <chr [5]> <list [5]>
4 D20130101 <chr [5]> <list [5]>
5 D20130101 <chr [5]> <list [5]>
6 D20130101 <chr [5]> <list [5]>

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up