foreach の .multicombine 引数について #rstatsj

Last updated at 2014-12-05Posted at 2014-12-05

R の foreach() には .multicombine 引数があります。

この .multicombine 引数について、

福島真太朗『Rによるハイパフォーマンスコンピューティング』

によると、

.combine 引数で与えた関数の引数が 3 個以上であるかどうかを表すフラグ。関数の引数が 3 個以上の場合、この引数を TRUE にしないと、最初の 2 つの引数のみが使用される。(デフォルト値は FALSE)

と書いてあります。

しかし、foreach 関数のヘルプには、次のように書いてあります。

logical flag indicating whether the .combine function can accept more than two arguments. If an arbitrary .combine function is specified, by default, that function will always be called with two arguments. If it can take more than two arguments, then setting .multicombine to TRUE could improve the performance. The default value is FALSE unless the .combine function is cbind, rbind, or c, which are known to take more than two arguments.

超訳してみると、

.multicombine は .combine 引数によって指定された関数が 2 つ以上の引数を受け取れるかどうかを示す論理フラグです。.combine 引数になんらかの関数が指定されている場合、デフォルトでは、その関数には 2 つの引数しか渡されません。もし、.combine 引数によって指定された関数が、3 つ以上の引数を受け取ることができる場合、.multicombine 引数を TRUE に設定することで、パフォーマンスの向上が期待できます。.multicombine 引数のデフォルト値は FALSE ですが、cbind, rbindおよび c の 3 つの関数に対しては、3 つ以上の引数を持つことがあらかじめ分かっているため、デフォルト値は TRUE になります。

となり、

.multicombine 引数はパフォーマンス向上のためにある
デフォルト値は FALSE だが、cbind, rbind, c に対しては TRUE

ということがわかります。

というわけで、.multicombine を TRUE にすると、本当にパフォーマンスが向上するのか調べてみました。

題材は、同じく

福島真太朗『Rによるハイパフォーマンスコンピューティング』

に出てくる Random Forest の並列化バージョンです(p.154-155)。

library(doParallel)
library(randomForest)
library(kernlab)

data(spam)
cores <- 4

Execute <- function(.multicombine) {
  cl <- makePSOCKcluster(cores)
  registerDoParallel(cl)
  
  start <- Sys.time()
  fit.rf <- foreach(ntree=rep(250, cores), .combine=combine, .export="spam",
                    .packages="randomForest", .multicombine=.multicombine) %dopar% {
    randomForest(type ~ ., data = spam, ntree = ntree)
  }
  end <- Sys.time()
  
  stopCluster(cl)
  (end - start)
}

結果

> Execute(.multicombine=FALSE)
Time difference of 12.18954 secs
> Execute(.multicombine=TRUE)
Time difference of 11.90909 secs

ここで、.combine 引数に指定された combine() 関数は、randomForest オブジェクトを結合する関数ですが、これは 3 つ以上の引数を取ることができるので、.multicombine 可能です。

.multicombine=FALSE の場合と .multicombine=TRUE の場合を比較していますが、あまり変わりませんね。
それもそのはず、.combine は reduce フェーズに使う関数なので、reduce する数が少ないと差はでません。

というわけで、ntree=rep(250, cores) を ntree=rep(25, cores * 100) に増やしてやってみます。

library(doParallel)
library(randomForest)
library(kernlab)

data(spam)
cores <- 4

Execute <- function(.multicombine) {
  cl <- makePSOCKcluster(cores)
  registerDoParallel(cl)
  
  start <- Sys.time()
  fit.rf <- foreach(ntree=rep(25, cores * 10), .combine=combine, .export="spam",
                    .packages="randomForest", .multicombine=.multicombine) %dopar% {
    randomForest(type ~ ., data = spam, ntree = ntree)
  }
  end <- Sys.time()
  
  stopCluster(cl)
  (end - start)
}

結果

> Execute(.multicombine=FALSE)
Time difference of 18.75828 secs
> Execute(.multicombine=TRUE)
Time difference of 13.57622 secs

おー、.multicombine=TRUE の方が速くなりましたね！

というわけで、foreach の .multicombine 引数を使ってパフォーマンスが向上することが確認できました。

以上です。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

foreach の .multicombine 引数について #rstatsj

関連記事