More than 1 year has passed since last update.

MySQLのgroup_concatでアソシエーション分析用のトランザクションデータを用意する

Posted at 2022-08-18

タイトルのままです。検索しても引っかからないので備忘録的に書いてみています。

実行環境

	ver
Macbook Pro	Catalina 10.15.7
MySQL	8.0.23
SequelPro	Nightly build 5446
R	4.0.2
RStudio	1.3.959
arules	1.7-3

確認1

こちらからデータをお借りしております。

簡単な注文明細テーブルを作成します。

create table order_test
(order_id char(5),
item varchar(10)
);

insert into order_test values
(111,"ビール"),
(111,"おむつ"),
(111,"ドレッシング"),
(111,"生野菜"),
(222,"ビール"),
(222,"柿ピー"),
(222,"ソーセージ"),
(333,"ソーダ"),
(333,"ポテトチップス"),
(333,"チョコレート"),
(333,"アイス"),
(444,"ビール"),
(444,"柿ピー"),
(444,"ソーセージ"),
(555,"生野菜"),
(555,"果物"),
(555,"豆腐"),
(555,"豚肉"),
(666,"生野菜"),
(666,"果物"),
(666,"豆腐"),
(666,"牛肉"),
(666,"じゃがいも"),
(777,"根菜"),
(777,"豆腐"),
(777,"牛肉"),
(777,"こんにゃく"),
(888,"ワイン"),
(888,"ソーセージ"),
(888,"チーズ"),
(999,"ワイン"),
(999,"牛肉"),
(999,"じゃがいも"),
(999,"じゃがいも");

次のようにgroup_concat()を使って集計するとカンマ区切りの文字列ができます。
distinctを入れることで重複があっても除外できます。（Rで省くこともできますが）
結果はcsvやtxtで保存しておきます。

select group_concat(distinct item order by item asc) as "tran"
from order_test
group by order_id;

test_posdata.csv

おむつ,ドレッシング,ビール,生野菜
ソーセージ,ビール,柿ピー
アイス,ソーダ,チョコレート,ポテトチップス
ソーセージ,ビール,柿ピー
果物,生野菜,豆腐,豚肉
じゃがいも,果物,牛肉,生野菜,豆腐
こんにゃく,根菜,牛肉,豆腐
ソーセージ,チーズ,ワイン
じゃがいも,ワイン,牛肉

データを保存するときに注意する点としてはヘッダー行が入っていたり、フィールドがダブルクオートで囲まれていると結果が変わってしまうので気をつけます。

"transaction" #ヘッダーをitemとして扱ってしまったり
"おむつ,ドレッシング,ビール,生野菜" # ""で囲まれた部分を一つのitemとして扱ってしまう
...

後は結果をcsvやtxtファイルとして保存してRで実行します。今回はアソシエーション分析については細かく触れません。

> sample1.tran <- read.transactions(file="test_posdata.csv",format="basket",sep=",",rm.duplicate=T)
> sample1.ap <- apriori(sample1.tran,parameter=list(support=0.01))
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen maxlen target  ext
        0.8    0.1    1 none FALSE            TRUE       5     0.1      1     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 0 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[19 item(s), 9 transaction(s)] done [0.00s].
sorting and recoding items ... [19 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 done [0.00s].
writing ... [138 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].

> inspect(head(sort(sample1.ap,by="support"),n=10))
     lhs                     rhs          support   confidence coverage  lift count
[1]  {柿ピー}             => {ソーセージ} 0.2222222 1          0.2222222 3.0  2    
[2]  {柿ピー}             => {ビール}     0.2222222 1          0.2222222 3.0  2    
[3]  {じゃがいも}         => {牛肉}       0.2222222 1          0.2222222 3.0  2    
[4]  {果物}               => {豆腐}       0.2222222 1          0.2222222 3.0  2    
[5]  {果物}               => {生野菜}     0.2222222 1          0.2222222 3.0  2    
[6]  {ソーセージ, 柿ピー} => {ビール}     0.2222222 1          0.2222222 3.0  2    
[7]  {ビール, 柿ピー}     => {ソーセージ} 0.2222222 1          0.2222222 3.0  2    
[8]  {ソーセージ, ビール} => {柿ピー}     0.2222222 1          0.2222222 4.5  2    
[9]  {果物, 豆腐}         => {生野菜}     0.2222222 1          0.2222222 3.0  2    
[10] {果物, 生野菜}       => {豆腐}       0.2222222 1          0.2222222 3.0  2

aprioriのデフォルト値はsupport =0.1, confidence=0.8, maxlen=5 らしいので、内容によっては全くルールが表示されないこともあります。そんなときはparameterを変更して調整します。

確認2

もう少し規模の大きなデータでも試してみます。
こちらからデータセットをお借りします。

group_concatはデフォルトの設定のままだと最大文字数が1024だったりするので、名前が長いものを扱いたいときは設定値を変更しておきます。

show variables like 'group_concat_max_len';
set  group_concat_max_len = 100000;

テーブルを準備します。中身はdata.csvをダウンロードしてインポートします。（kaggleはたしかアカウント登録しないとダウンロードができないと思うので登録しましょう）

CREATE TABLE `ECdata` (
  `InvoiceNo` varchar(10),
  `Description` varchar(100)
);

後は同じようにキーになるIDをgroup byに指定してgroup_concatで商品名をつなげます。
結果はcsv等で保存しておきます。

select group_concat(distinct Description order by Description asc) as "tran"
from ECdata
group by InvoiceNo

> sample2.tran <- read.transactions(file="test_posdata.csv",format="basket",sep=",",rm.duplicate=T)
> sample2.ap <- apriori(sample2.tran,parameter=list(support=0.02,confidence=0.4))
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen maxlen target  ext
        0.4    0.1    1 none FALSE            TRUE       5    0.02      1     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 331 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[8978 item(s), 16557 transaction(s)] done [0.13s].
sorting and recoding items ... [150 item(s)] done [0.00s].
creating transaction tree ... done [0.01s].
checking subsets of size 1 2 3 done [0.00s].
writing ... [23 rule(s)] done [0.00s].
creating S4 object  ... done [0.01s].
> inspect(head(sort(sample2.ap,by="support"),n=10))
     lhs                                         rhs                                      support    confidence coverage   lift      count
[1]  {ALARM  CLOCK  BAKELIKE  GREEN}          => {ALARM  CLOCK  BAKELIKE  RED}            0.03207103 0.6467722  0.04958628 11.964925 531  
[2]  {ALARM  CLOCK  BAKELIKE  RED}            => {ALARM  CLOCK  BAKELIKE  GREEN}          0.03207103 0.5932961  0.05405569 11.964925 531  
[3]  {JUMBO  BAG  PINK  POLKADOT}             => {JUMBO  BAG  RED  RETROSPOT}             0.03201063 0.6471306  0.04946548  8.007879 530  
[4]  {JUMBO  STORAGE  BAG  SUKI}              => {JUMBO  BAG  RED  RETROSPOT}             0.02633327 0.6030429  0.04366733  7.462318 436  
[5]  {JUMBO  SHOPPER  VINTAGE  RED  PAISLEY}  => {JUMBO  BAG  RED  RETROSPOT}             0.02572930 0.5539662  0.04644561  6.855021 426  
[6]  {LUNCH  BAG    BLACK  SKULL.}            => {LUNCH  BAG  RED  RETROSPOT}             0.02325300 0.4588796  0.05067343  7.955675 385  
[7]  {LUNCH  BAG  RED  RETROSPOT}             => {LUNCH  BAG    BLACK  SKULL.}            0.02325300 0.4031414  0.05767953  7.955675 385  
[8]  {GARDENERS  KNEELING  PAD  CUP  OF  TEA} => {GARDENERS  KNEELING  PAD  KEEP  CALM}   0.02289062 0.7044610  0.03249381 17.834496 379  
[9]  {GARDENERS  KNEELING  PAD  KEEP  CALM}   => {GARDENERS  KNEELING  PAD  CUP  OF  TEA} 0.02289062 0.5795107  0.03949991 17.834496 379  
[10] {ALARM  CLOCK  BAKELIKE  PINK}           => {ALARM  CLOCK  BAKELIKE  RED}            0.02289062 0.6192810  0.03696322 11.456353 379

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up