More than 5 years have passed since last update.

Rのdo.call()はクラシカルな高次関数か？使い方を学ぶ

Last updated at 2016-10-08Posted at 2016-10-08

Kaggleの参加者コードを紹介するKaggle kernelsを見ていると do.call() を多用するRコードを見かけた．do.call() はほとんど初見だったので調べてみたところ，割とクラシカルな関数で，使い方も難しくないようだ．以下，忘れないようにメモしておく．

do.call()の概要

まずは，CRANマニュアルから引用する．

do.call - Execute a Function Call

Description

do.call constructs and executes a function call from a name or a function and a list of arguments to be passed to it.

Usage
do.call(what, args, quote = FALSE, envir = parent.frame())

Arguments

what either a function or a non-empty character string naming the function to be called.

args a list of arguments to the function call. The names attribute of args gives the argument names.

quote a logical value indicating whether to quote the arguments.

envir an environment within which to evaluate the call. This will be most useful if what is a character string and the arguments are symbols or quoted expressions.

機能としては，ずばり「関数呼び出し」(Function Call)である．R言語には，充実のApply関数群を備えているのでそちらが有名のような気がするが，このdo.call()もケースに応じて使われているようである．引数は上記の通り４つ取るようだが，必須なのは最初の２つ，関数オフジェクトの "what" とそれに渡す引数 "args" である．"args"は，リスト変数とする必要がある．

少し使用例を示す．

まず，関数を定義する．

# define my own function
myrange <- function (larg) {
    nv <- unlist(larg)
    rg <- max(nv) - min(nv)
    return(rg)
}

ここでは，Rですぐに参照できる"iris"を使う．

# Data.Frame example
head(iris)

Table 1. Iris Dataset

定義した関数 "myrange" をdo.call()する．

do.call(myrange, list(iris$Sepal.Length))
# Out: 3.6

期待通り Sepal.Lengh の最大値 - 最小値の値(3.6)が出力された．
一応，Rビルトインの range() で計算すると，4.3，7.9（最小値，最大値）となったので，上記 3.6(= 7.9 - 4.3) と解は一致している．

もう一例，確認する．まず，数値をnormalizeする関数を用意する．入力データサンプルを用意して，以下のように do.call() を実行する．

normalize <- function(x, m=mean(x), s=sd(x)) {
    (x - m) /s
}

myseq = list(c(1, 3, 6, 10, 15))
do.call(normalize, myseq)

# -1.0690449676497 -0.712696645099798 -0.17817416127495 0.534522483824849 1.4253932901996

出力された数値リストの平均と標準偏差は，

mean of normalized =
[1] -5.572799e-18
standard deviation = 
[1] 1

のように０近傍値と1であることから，期待したnormalizeが実行できていることが分かる．

Python Pandasのapply()と比較する

Rのdo.call()に近いのはPythonビルトイン関数のmap()のような気もするが，個人的にあまり使わないので，今回は，Pandasのapply()と比較する．（参考： "Python for Data Analysis" - O'reilly media）
まず，サンプルデータを用意．

# Sample Data
frame = pd.DataFrame(np.random.randn(4,3), columns=list('bde'),
                    index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

** Table 2. Data Example**

レンジ(最大値 - 最小値）を算定する関数を用意して，それをpd.DataFrameにapply() する．

# define lambda function
f = lambda x: x.max() - x.min()
frame[['d']].apply(f)
# if I execute frame['d'].apply(f), error is raised. "apply()" is for pd.DataFrame

これで期待した動作となる．

Out: d    4.016529
dtype: float64

カラム指定を数値で行う場合は，iloc[]を使う以下のパターン．

frame.iloc[:, [2]].apply(f)

# Out: e    2.160329
# dtype: float64

注意点としては，シーケンスを所定関数にわたしたいので，frame[['d']] や frame.iloc[:, [2]] のようにカラム指定をリストで行わなければならないことである．(これを frame['d'], frame.iloc[:, 2] にすると，pd.Seriesオブジェクトに対するapply()，スカラー要素ごとの処理と解釈され，errorとなる．）

これで先ほどのR，do.call()と同じ動作が実現できた．

まとめ

do.call() はあまり見かけない関数だが（私だけ？），「data.frame に対して処理を行ったのちにまとめる」という状況で使われるようである．但し，Apply関数群の方が便利で，do.call()は「クラシカル」な書き方であるように見える．個人的には積極的にdo.call()を使おうと思わないが，人のコードのdo.call()を見たときに，あわてずにきちんと理解したいと思う．

Pythonで do.call() にあたるものは見当たらないが，Pandasのapply()，もしくは（データをばらして）リスト内包表記による処理を行うことより，所望の動作が実現できると思われる．

(R は，ver. 3.3.1 (on jupyter notebook), Python は，ver. 3.5.2 (on jupyter notebook) を用いました．）

参考文献

R: A Language and Environment for Statistical Computing - CRAN
https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf
Learning R - Oreilly media
http://shop.oreilly.com/product/0636920028352.do
Python for Data Analysis - Oreilly media
http://shop.oreilly.com/product/0636920023784.do
知っているといつか役にたつ(?)関数達 (No. 49) - Rjpwiki
http://www.okadajp.org/RWiki/?%E7%9F%A5%E3%81%A3%E3%81%A6%E3%81%84%E3%82%8B%E3%81%A8%E3%81%84%E3%81%A4%E3%81%8B%E5%BD%B9%E3%81%AB%E7%AB%8B%E3%81%A4%28%3F%29%E9%96%A2%E6%95%B0%E9%81%94

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up