More than 5 years have passed since last update.

Rのsummary関数についてのメモ。

Last updated at 2017-01-16Posted at 2017-01-16

以下の記事を読みました:

http://qiita.com/imaizume/items/bef1d9324c9e1b980b3a

無知なのでなんでこうなるのかよくわからなかったので調べてみました。その際のメモです。

再現用のRコード

# データセット作成
df <- data.frame(x1 = rnorm(n = 100, mean = 0, sd = 1),
                 x2 = rnorm(n = 100, mean = 3, sd = 4))

# そのままsummaryへ
df_summary <- summary(df)
df_summary
# >         x1                 x2         
# >   Min.   :-2.81552   Min.   :-5.5362  
# >   1st Qu.:-0.71615   1st Qu.: 0.1137  
# >   Median :-0.06726   Median : 2.8702  
# >   Mean   :-0.05051   Mean   : 3.0947  
# >   3rd Qu.: 0.58038   3rd Qu.: 5.9660  
# >   Max.   : 2.31044   Max.   :13.1016

# applyをかます
df_summary_apply <- apply(df, 2, summary)
df_summary_apply
# >                x1      x2
# >  Min.    -2.81600 -5.5360
# >  1st Qu. -0.71610  0.1137
# >  Median  -0.06726  2.8700
# >  Mean    -0.05051  3.0950
# >  3rd Qu.  0.58040  5.9660
# >  Max.     2.31000 13.1000

再現できた。

classの確認

summaryは汎用関数で，第一引数のclassによって処理が変わる。

class(df)
# >  [1] "data.frame"

まあそりゃそうですよね。applyはこの場合列ごとにsummaryへ投げている(と思う)ので，以下で確認。

class(df$x1)
# >  [1] "numeric"

まあそうですよね。

`summary`のmethodを確認

summaryにdata.frameを渡すと，summary.data.frame()で処理される。このコードでポイントになるのはたぶんこのあたり:

https://github.com/wch/r-source/blob/af7f52f70101960861e5d995d3a4bec010bc89e6/src/library/base/R/summary.R#L149-L177

最終的には次元とclassにtableが付与されて戻ってくる。なるほど。

一方，summaryにnumericベクトルを渡すと，summary.defalut()で処理される。ポイントになるのはたぶんこのあたり:

https://github.com/wch/r-source/blob/af7f52f70101960861e5d995d3a4bec010bc89e6/src/library/base/R/summary.R#L39-L49

名前付きベクトルが戻ってくる。なるほど。

`apply`の確認

詳しくは?applyを参照。今回引数MARGINに2を指定しているので，列ごとにFUNで指定した関数を処理させるようになる。

ヘルプドキュメントのValueには以下のように説明がある:

If each call to FUN returns a vector of length n, then apply returns an array of dimension c(n, dim(X)[MARGIN]) if n > 1. If n equals 1, apply returns a vector if MARGIN has length 1 and an array of dimension dim(X)[MARGIN] otherwise. If n is 0, the result has length 0 but not necessarily the ‘correct’ dimension.

If the calls to FUN return vectors of different lengths, apply returns a list of length prod(dim(X)[MARGIN]) with dim set to MARGIN if this has length greater than one.

In all cases the result is coerced by as.vector to one of the basic vector types before the dimensions are set, so that (for example) factor results will be coerced to a character array.

もしNAが全ての列にない，あるいはNAが全ての列にある場合，各列をsummaryが処理した時に返ってくるベクトル長は同じだから，c(n, dim(X)[MARGIN])を次元に設定した配列が返ってくるはず。でもって今回はlength(c(n, dim(X)[MARGIN]))は2なので，自動的にmatrixとしてRは認識する。さらにsummary.defalut()で返ってくる内容は名前付きベクトルで，行名にはminとかmeanとかが付与されてる(前節のリンク先参照)。

よって，df_summary_applyでは上記のような結果が返ってくると。なるほど。

補足

冒頭の記事にある「NAが混入するとlistで出てきてしまう」というのは，apply()が各列をsummary.default()で処理する際に，summary.default()でNAがあるとNA用の値を返す(NAがないとそれを省く)ため，ベクトルの長さが揃わないため。上記引用した内容の第2パラグラフ参照。

また文字列の変数がある場合については，以下のような挙動になる:

# 文字列ベクトルを作ってsummaryの挙動を確認
x <- sample(x = c("a", "b", "c"), size = 100, replace = TRUE)
class(x)
# >  [1] "character"
summary(x)
# >     Length     Class      Mode 
# >        100 character character
summary(factor(x))
# >   a  b  c 
# >  39 30 31

# 混ぜてみて検討
df2 <- cbind(df, x)
summary(df2)
# >         x1                 x2          x     
# >   Min.   :-2.81552   Min.   :-5.5362   a:39  
# >   1st Qu.:-0.71615   1st Qu.: 0.1137   b:30  
# >   Median :-0.06726   Median : 2.8702   c:31  
# >   Mean   :-0.05051   Mean   : 3.0947         
# >   3rd Qu.: 0.58038   3rd Qu.: 5.9660         
# >   Max.   : 2.31044   Max.   :13.1016
apply(df2, 2, summary)
# >         x1          x2          x          
# >  Length "100"       "100"       "100"      
# >  Class  "character" "character" "character"
# >  Mode   "character" "character" "character"

# factor型にして検討
df3 <- cbind(df, factor(x))
summary(df3)
# >         x1                 x2          factor(x)
# >   Min.   :-2.81552   Min.   :-5.5362   a:39     
# >   1st Qu.:-0.71615   1st Qu.: 0.1137   b:30     
# >   Median :-0.06726   Median : 2.8702   c:31     
# >   Mean   :-0.05051   Mean   : 3.0947            
# >   3rd Qu.: 0.58038   3rd Qu.: 5.9660            
# >   Max.   : 2.31044   Max.   :13.1016
apply(df3, 2, summary)
# >         x1          x2          factor(x)  
# >  Length "100"       "100"       "100"      
# >  Class  "character" "character" "character"
# >  Mode   "character" "character" "character"

data.frame型でそのまま処理させた時の挙動はいつも通りで，applyを適用した場合になぜx1とx2も文字列ベクトルのように処理されるのかというと，applyにdata.frameを渡すと一旦強制的にas.matrixに変換しようとするためだと思う。

applyのヘルプドキュメントのDetailsにこんな記載がある:

If X is not an array but an object of a class with a non-null dim value (such as a data frame), apply attempts to coerce it to an array via as.matrix if it is two-dimensional (e.g., a data frame) or via as.array.

実際に強制的にmatrix型にしてみると，以下のような感じになる:

str(as.matrix(df2))
# >   chr [1:100, 1:3] " 1.941737947" " 0.576089467" " 0.205150425" ...
# >   - attr(*, "dimnames")=List of 2
# >    ..$ : NULL
# >    ..$ : chr [1:3] "x1" "x2" "x"

見事に全て文字列になってます。これは，Rの行列はそもそも次元情報を属性に持ったベクトルなので，行列(次元情報を持ったベクトル)に落としこもうとすると，どうしても文字列に寄せていくしかないからです。

雑感

たぶんこういうことなんだと思うのですが，間違っていたら教えていただけると助かります。

あとsummary面倒。こういう用途なら個人的にはdplyr::summarize使っていきたいです。

Enjoy!

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

Rのsummary関数についてのメモ。

再現用のRコード

classの確認

summaryのmethodを確認

applyの確認

補足

雑感

`summary`のmethodを確認

`apply`の確認