More than 3 years have passed since last update.

[R] NbClustの計算時間を指標ごとに計測してみた

Last updated at 2020-11-28Posted at 2020-11-28

背景

Rでクラスタリングのクラスタ数を提案してくれるライブラリであるNbClustは、デフォルトで計算する指標の設定がindex='all'になっていて、GAP Gamma Gplus Tau以外の指標が計算されます。この4指標は計算時間がかかるので除外されているのですが、実際に使うとそれでも重いです。

しかし、試しにindex='db'などとして1指標だけ計算すると、すぐに計算が終了します。
そこで、上記4種類以外にも計算時間がかかる指標があるのではないかと思い調べてみました。

計測方法と結果

指標ごとに、irisデータセットをWard法でクラスタリングした場合の計算時間を測りました。実行コードは次の通りです。

library(NbClust)
index_ary<-c("kl", "ch", "hartigan", "ccc", "scott", "marriot", "trcovw", "tracew",
             "friedman", "rubin", "cindex", "db", "silhouette", "duda", "pseudot2",
             "beale", "ratkowsky", "ball", "ptbiserial", "gap", "frey", "mcclain", 
             "gamma", "gplus", "tau", "dunn", "hubert", "sdindex", "dindex", "sdbw")
for(index in index_ary){
  start.time<-proc.time()
  NbClust(iris[,-5], method = "ward.D", index = index)
  end.time<-proc.time()
  cat(index,' ',(end.time-start.time)[3],'\n')
}

結果はこの様になりました。

指標	経過秒数	特記事項
kl	0.033
ch	0.024
hartigan	0.023
ccc	0.01
scott	0.01
marriot	0.012
trcovw	0.011
tracew	0.01
friedman	0.01
rubin	0.01
cindex	0.009
db	0.014
silhouette	0.105	少し重い
duda	0.02
pseudot2	0.023
beale	0.019
ratkowsky	0.016
ball	0.013
ptbiserial	0.236	少し重い
gap	0.272	少し重い
frey	0.043
mcclain	0.046
gamma	3.075	重い
gplus	3.046	重い
tau	3.031	重い
dunn	0.013
hubert	0.268	少し重い
sdindex	0.034
dindex	0.072
sdbw	0.283	少し重い

index='all'で除外される4指標はどれも重く、特にgamma gplus tau が重い事がわかりました。

そして、ここが重要なポイントですが、除外される4指標以外に、silhouette、ptbiserial hubert sdbw も少し遅めである事が分かりました。もっとも、重いと言ってもgammaなどの10分の1程度でした。

まとめ

指標ごとにNbCLustの実行時間を計測したところ、index='all'で除外される指標以外にも重めな指標がいくつかある事が分かりました。ただし、index='all'で除外される指標の10分の1程度の重さでした。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up