More than 5 years have passed since last update.

R でアソシエーション分析

Last updated at 2019-09-05Posted at 2020-08-14

アソシエーション分析とは

アソシエーション分析とは、データに内在する項目どうしの関連を、アソシエーションルールとして抽出する手法である。
ルールは {A ⇒ B} の形式で表現され、A を条件部、B を結論部と呼ぶ。
ルールには、その重要性を表す評価指標が存在し、閾値などの形で評価指標に関する条件を指定することによって、条件を満たすルールが抽出される。多く用いられる評価指標である $support$、$confidence$、$lift$ を以下に示す。

$$support = P(A \cap B) = \dfrac{N(A \cap B)}{N_{r}}$$
$$confidence = P(B | A) = \dfrac{N(A \cap B)}{N(A)}$$
$$lift= \dfrac{P(B | A)}{P(B)}$$

$N_{r}$ は全データ件数、$N(A)$ は条件 A を満たすデータ件数である。
例えば、以下のようなアンケートデータにおいて、{(性別＝男性) & (年齢=20代) ⇒ (製品A 評価＝好き)} の $support$ は $2/5=0.4$、$confidence$ は $2/2=1.0$、$lift$ は $(2/2)/(2/5)=2.5$ となる。

回答者No.	性別	年齢	製品 A 評価
1	男性	20代	好き
2	女性	20代	嫌い
3	男性	40代	嫌い
4	男性	20代	好き
5	女性	50代	どちらでもない

一般的には、上述した三つの評価指標を用いてルール抽出を行うことが多いが、これらがそれぞれどのような条件を満たせば、重要なルールであると判断できるのかという明確な基準はない。これに対し、統計学で用いられる $\chi_{2}$ 値を評価指標とする場合もある。
{A ⇒ B} について、以下のような分割表を考える。

	B	not B
A	a	b
not A	c	d

表中の各要素は該当する行頭、列頭をともに満たすデータ件数を表しており、$a = N(A \cap B)$ である。この分割表に対して、$\chi_{2}$ 値は以下の式で表される。

$$\chi_{2} = \dfrac{N_{r}(ad-bc)^2}{(a+b)(c+d)(a+c)(b+d)}$$

$\chi_{2}$ は A と B の関連性を表し、また、自由度 1 の $\chi_{2}$ 分布に従うことが知られている。これにより、統計学で用いられる有意水準をルール抽出における閾値の指標として用いることが可能である。例えば、有意水準を 5 %とすると $\chi_{2} > 3.84$、1 %とすると $\chi_{2} > 6.63$ となる。
※閾値として用いるだけで、有意差のあるなしを判定しているわけではない

R でアソシエーション分析①

環境

>ver

Microsoft Windows [Version 10.0.17763.678]

> version
               _                           
platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          3                           
minor          5.3                         
year           2019                        
month          03                          
day            11                          
svn rev        76217                       
language       R                           
version.string R version 3.5.3 (2019-03-11)
nickname       Great Truth

使用データ

kaggle の Titanic データ
　Titanic: Machine Learning from Disaster

実験

結論部が Survived=1 もしくは Survived=0 となるルールを抽出することにより、「どのような人が生き残ったか／生き残らなかったか」を把握する。

パッケージインストール

install.packages('arules', dependencies=TRUE)
library(arules)

データ読み込み

data<-read.csv('xxx/train.csv', fileEncoding ='utf8')

# 今回は一部の列のみ使用
data<-data[, c('Survived', 'Pclass', 'Sex', 'Age', 'SibSp','Parch', 'Fare', 'Embarked')]

# 適切なデータ型に変換
data$Survived<-as.factor(data$Survived)
data$Pclass<-as.factor(data$Pclass)

str(data)

'data.frame':	891 obs. of  8 variables:
 $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
 $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
 $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age     : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp   : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch   : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

アソシエーション分析

# アソシエーション分析において数値変数は扱えないため、一旦カテゴリ変数のみ使用
data.1<-data[, c('Survived', 'Pclass', 'Sex', 'Embarked')]

# ルールを抽出
rules<-apriori(data.1, parameter=list(supp=0.01, maxlen=4))

Apriori

Parameter specification:

Algorithmic control:

Absolute minimum support count: 8 

set item appearances ...[0 item(s)] done [0.01s].
set transactions ...[11 item(s), 891 transaction(s)] done [0.00s].
sorting and recoding items ... [10 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.01s].
writing ... [50 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].

上記では、$support$ とルールの長さ（条件部と結論部の項目数の合計）に関する条件を指定し、ルールを抽出した。ここから、抽出したルールを目的に応じてフィルタしていく。今回知りたいのは「どのような人が生き残ったか／生き残らなかったか」であるため、結論部が Survived=1 もしくは Survived=0 であるかでフィルタし、$lift$ 順に表示する。

# ルールをフィルタ
subrules<-subset(rules, subset=rhs %in% c('Survived=0', 'Survived=1'))

# ルールを表示
inspect(head(sort(subrules, by='lift'), n=20))

     lhs                                 rhs          support    confidence lift     count
[1]  {Pclass=1,Sex=female,Embarked=C} => {Survived=1} 0.04713805 0.9767442  2.544676  42  
[2]  {Pclass=1,Sex=female}            => {Survived=1} 0.10213244 0.9680851  2.522116  91  
[3]  {Pclass=1,Sex=female,Embarked=S} => {Survived=1} 0.05162738 0.9583333  2.496711  46  
[4]  {Pclass=2,Sex=female}            => {Survived=1} 0.07856341 0.9210526  2.399584  70  
[5]  {Pclass=2,Sex=female,Embarked=S} => {Survived=1} 0.06846240 0.9104478  2.371956  61  
[6]  {Sex=female,Embarked=C}          => {Survived=1} 0.07182941 0.8767123  2.284066  64  
[7]  {Sex=male,Embarked=Q}            => {Survived=0} 0.04264871 0.9268293  1.504198  38  
[8]  {Pclass=3,Sex=male,Embarked=Q}   => {Survived=0} 0.04040404 0.9230769  1.498108  36  
[9]  {Pclass=3,Sex=male,Embarked=S}   => {Survived=0} 0.25925926 0.8716981  1.414723 231  
[10] {Pclass=3,Sex=male}              => {Survived=0} 0.33670034 0.8645533  1.403128 300  
[11] {Pclass=2,Sex=male,Embarked=S}   => {Survived=0} 0.09203143 0.8453608  1.371979  82  
[12] {Pclass=2,Sex=male}              => {Survived=0} 0.10213244 0.8425926  1.367486  91  
[13] {Sex=male,Embarked=S}            => {Survived=0} 0.40852974 0.8253968  1.339578 364  
[14] {Sex=male}                       => {Survived=0} 0.52525253 0.8110919  1.316362 468  
[15] {Pclass=3,Embarked=S}            => {Survived=0} 0.32098765 0.8101983  1.314912 286

ルール [2][4] から、チケットクラスが 1st もしくは 2nd の女性は 9 割以上（全体平均の 2 倍以上）生き残ったことがわかる。また、ルール [7] から、出港地が Queenstown の男性が 9 割以上（全体平均の 1.5 倍以上）生き残らなかったことも興味深い。
このように、アソシエーション分析によってデータの傾向把握やセグメンテーションを簡単に行うことができる。

数値変数のビニング

アソシエーション分析において数値変数は扱えないが、ビニングし、カテゴリ変数に変換すれば扱うことができる。
ビニングの方法は色々あり、シンプルなものでは

指定した境界値で分割（ドメイン知識やデータ可視化結果に基づくことが多い）
分位点で分割
などが挙げられるが、今回はターゲットが Survived と決まっていることもあり、WoE Binning という教師ありの方法を試す。

（R Package における）WoE Binning の流れ

含まれるデータ数がほぼ同一になるようにビニング
隣り合うビンどうしで WoE（Weight of Evidence）の値が最も近いものを統合
IV（Information Value）の減少が一定割合以内の場合は採用・2 に戻る、そうでない場合は不採用・終了

$$WoE_{i} =\ln \dfrac{\dfrac{N(Event|i)}{N(Event)}}{\dfrac{N(non Event|i)}{N(non Event)}} = \ln \dfrac{P(Event|i)}{P(non Event|i)} - \ln \dfrac{P(Event)}{P(non Event)}$$

$$IV = \sum_{i=1}^k (\dfrac{N(Event|i)}{N(Event)}-\dfrac{N(non Event|i)}{N(non Event)}) \times WoE_{i} $$

$Event／nonEvent$ はターゲットの値で、今回の場合は生き残った／生き残らなかったとなる。また、$k$ はビンの数である。
WoE は、あるビンにおけるターゲットのオッズ比が、全体と比べてどれくらい上がっているかを表している。

R でアソシエーション分析②

WoE Binning

# 数値変数をビニング
col<-c('Age', 'SibSp', 'Parch', 'Fare')
bin<-woe.binning(data, target.var='Survived', pred.var=col)

# ビニング結果を元データに反映
data.2<-woe.binning.deploy(data, bin)
data.2<-data.2[, !(colnames(data.2) %in% col)]

str(data.2)

'data.frame':	891 obs. of  8 variables:
 $ Survived    : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
 $ Pclass      : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
 $ Sex         : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Embarked    : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
 $ Fare.binned : Factor w/ 4 levels "(-Inf,10.5]",..: 1 3 1 2 1 1 2 2 2 2 ...
 $ SibSp.binned: Factor w/ 4 levels "(-Inf,0]","(0,1]",..: 2 2 1 2 1 1 1 3 1 2 ...
 $ Parch.binned: Factor w/ 3 levels "(-Inf,0]","(0, Inf]",..: 1 1 1 1 1 1 1 2 2 1 ...
 $ Age.binned  : Factor w/ 4 levels "(-Inf,19]","(19,22]",..: 2 3 3 3 3 4 3 1 3 1 ...

ビニング結果を確認すると、例えば Age は 19, 22 が境界値となっている。

print(bin)
print(bin[,2])

     [,1]    [,2]   [,3]      
[1,] "Fare"  List,9 0.5235679 
[2,] "SibSp" List,9 0.132211  
[3,] "Parch" List,9 0.08971808
[4,] "Age"   List,9 0.07727167
[[1]]
                     woe cutpoints.final cutpoints.final[-1] iv.total.final   1
(-Inf,10.5]    -85.54612            -Inf                10.5      0.5235679  76
(10.5,56.4958]  24.67601         10.5000             56.4958      0.5235679 177
(56.4958, Inf] 127.30446         56.4958                 Inf      0.5235679  89
Missing               NA             Inf             Missing      0.5235679   0
                 0 col.perc.a col.perc.b   iv.bins
(-Inf,10.5]    287  0.2222222 0.52276867 0.2571058
(10.5,56.4958] 222  0.5175439 0.40437158 0.0279264
(56.4958, Inf]  40  0.2602339 0.07285974 0.2385357
Missing          0  0.0000000 0.00000000        NA

[[2]]
               woe cutpoints.final cutpoints.final[-1] iv.total.final   1   0
(-Inf,0] -16.60568            -Inf                   0       0.132211 210 398
(0,1]     61.70756               0                   1       0.132211 112  97
(1, Inf] -51.99641               1                 Inf       0.132211  20  54
Missing         NA             Inf             Missing       0.132211   0   0
         col.perc.a col.perc.b    iv.bins
(-Inf,0] 0.61403509 0.72495446 0.01841891
(0,1]    0.32748538 0.17668488 0.09305531
(1, Inf] 0.05847953 0.09836066 0.02073675
Missing  0.00000000 0.00000000         NA

[[3]]
               woe cutpoints.final cutpoints.final[-1] iv.total.final   1   0
(-Inf,0] -17.37481            -Inf                   0     0.08971808 233 445
(0, Inf]  52.02447               0                 Inf     0.08971808 109 104
Missing         NA             Inf             Missing     0.08971808   0   0
         col.perc.a col.perc.b    iv.bins
(-Inf,0]  0.6812865  0.8105647 0.02246183
(0, Inf]  0.3187135  0.1894353 0.06725625
Missing   0.0000000  0.0000000         NA

[[4]]
                 woe cutpoints.final cutpoints.final[-1] iv.total.final   1   0
(-Inf,19]  40.008430            -Inf                  19     0.07727167  79  85
(19,22]   -45.347433              19                  22     0.07727167  19  48
(22, Inf]   5.745981              22                 Inf     0.07727167 192 291
Missing   -40.378231             Inf             Missing     0.07727167  52 125
          col.perc.a col.perc.b    iv.bins
(-Inf,19] 0.23099415 0.15482696 0.03047330
(19,22]   0.05555556 0.08743169 0.01445501
(22, Inf] 0.56140351 0.53005464 0.00180130
Missing   0.15204678 0.22768670 0.03054206

アソシエーション分析

# ルールを抽出
rules<-apriori(data.2, parameter=list(supp=0.01, maxlen=4))

Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen maxlen target
        0.8    0.1    1 none FALSE            TRUE       5    0.01      1      4  rules
   ext
 FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 8 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[23 item(s), 891 transaction(s)] done [0.00s].
sorting and recoding items ... [22 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [1413 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].
Warning message:
In apriori(data.2, parameter = list(supp = 0.01, maxlen = 4)) :
  Mining stopped (maxlen reached). Only patterns up to a length of 4 returned!

①と同様の条件でルールをフィルタ、表示する。

# ルールをフィルタ
subrules<-subset(rules, subset=rhs %in% c('Survived=0', 'Survived=1'))

# ルールを表示
inspect(head(sort(subrules, by='lift'), n=20))

     lhs                             rhs             support confidence     lift count
[1]  {Sex=female,                                                                     
      Embarked=C,                                                                     
      Fare.binned=(56.4958, Inf]} => {Survived=1} 0.04152637  1.0000000 2.605263    37
[2]  {Sex=female,                                                                     
      Fare.binned=(56.4958, Inf],                                                     
      SibSp.binned=(-Inf,0]}      => {Survived=1} 0.04040404  1.0000000 2.605263    36
[3]  {Sex=female,                                                                     
      Fare.binned=(56.4958, Inf],                                                     
      Parch.binned=(-Inf,0]}      => {Survived=1} 0.05387205  1.0000000 2.605263    48
[4]  {Pclass=2,                                                                       
      Sex=female,                                                                     
      Age.binned=(-Inf,19]}       => {Survived=1} 0.01795735  1.0000000 2.605263    16
[5]  {Pclass=1,                                                                       
      SibSp.binned=(-Inf,0],                                                          
      Age.binned=(-Inf,19]}       => {Survived=1} 0.01010101  1.0000000 2.605263     9
[6]  {Pclass=1,                                                                       
      Sex=female,                                                                     
      Age.binned=Missing}         => {Survived=1} 0.01010101  1.0000000 2.605263     9
[7]  {Pclass=1,                                                                       
      Sex=female,                                                                     
      Parch.binned=(-Inf,0]}      => {Survived=1} 0.07070707  0.9843750 2.564556    63
[8]  {Sex=female,                                                                     
      Fare.binned=(56.4958, Inf],                                                     
      Age.binned=(22, Inf]}       => {Survived=1} 0.05836139  0.9811321 2.556107    52
[9]  {Pclass=1,                                                                       
      Sex=female,                                                                     
      SibSp.binned=(-Inf,0]}      => {Survived=1} 0.05387205  0.9795918 2.552095    48
[10] {Pclass=1,                                                                       
      Sex=female,                                                                     
      Embarked=C}                 => {Survived=1} 0.04713805  0.9767442 2.544676    42
[11] {Pclass=1,                                                                       
      Sex=female,                                                                     
      Fare.binned=(56.4958, Inf]} => {Survived=1} 0.07856341  0.9722222 2.532895    70
[12] {Pclass=1,                                                                       
      Sex=female,                                                                     
      Age.binned=(22, Inf]}       => {Survived=1} 0.07070707  0.9692308 2.525101    63
[13] {Pclass=1,                                                                       
      Sex=female}                 => {Survived=1} 0.10213244  0.9680851 2.522116    91
[14] {Pclass=2,                                                                       
      Sex=female,                                                                     
      Parch.binned=(0, Inf]}      => {Survived=1} 0.03367003  0.9677419 2.521222    30
[15] {Pclass=1,                                                                       
      Sex=female,                                                                     
      Embarked=S}                 => {Survived=1} 0.05162738  0.9583333 2.496711    46
[16] {Pclass=1,                                                                       
      Sex=female,                                                                     
      Fare.binned=(10.5,56.4958]} => {Survived=1} 0.02356902  0.9545455 2.486842    21
[17] {Pclass=2,                                                                       
      Parch.binned=(0, Inf],                                                          
      Age.binned=(-Inf,19]}       => {Survived=1} 0.02244669  0.9523810 2.481203    20
[18] {Pclass=1,                                                                       
      Sex=female,                                                                     
      SibSp.binned=(0,1]}         => {Survived=1} 0.04264871  0.9500000 2.475000    38
[19] {Sex=female,                                                                     
      Embarked=C,                                                                     
      Age.binned=(22, Inf]}       => {Survived=1} 0.03928171  0.9459459 2.464438    35
[20] {Sex=female,                                                                     
      Fare.binned=(56.4958, Inf],                                                     
      SibSp.binned=(0,1]}         => {Survived=1} 0.03591470  0.9411765 2.452012    32

変数を追加したことにより、$confidence$ が 1.0（$lift$ が 2.6）のルールも現れるようになった。

おまけ

arulesViz パッケージをインストールすれば、ルールの表示形式をリッチにできる。

パッケージインストール

install.packages('arulesViz', dependencies=TRUE)
library(arulesViz)

Interactive inspect

ルールのフィルタ・表示条件の変更を GUI 上でインタラクティブに行えるようにする。

inspectDT(subrules)

Graph-based visualization

ルールをグラフ可視化する。

plot(subrules, method='graph', engine='htmlwidget')

他にも色々な形式があるが、個人的には Interactive inspect 以外は使い辛い。

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up