More than 3 years have passed since last update.

DeNA Advent Calendar 2021

@yaiwase(岩瀬靖彦)in

株式会社ディー・エヌ・エー

SQLのみで協調フィルタリング

Last updated at 2022-01-20Posted at 2021-12-12

この記事は、DeNA Advent Calendar 2021 13日目の記事です

DeNA スポーツ事業本部でエンジニアをしている @yaiwase です、よろしくお願いします
最近は先日リリースされたデジタルムービーコレクションサービス PLAYBACK9 を担当しています

概要

次のようなよくあるデータ構造をみて「商品Aを買ったユーザにおすすめできる商品B」を知りたくなったとき、 SQL で手軽に取得する方法について書きます

create table `users` (user_id bigint);
create table `items` (item_id bigint);
create table `sales` (user_id bigint, item_id bigint);

アルゴリズム

アイテムベース協調フィルタリング

取得したい値を式で表現すると、次の通りです


similarity（i,j） = \frac{\sum_{u}^{U} r(u,i)r(u,j)}{  \sqrt{\sum_{u}^{U} r^2(u,i) }  \sqrt{\sum_{u}^{U} r^2(u,j) }  }

\quad \\
\quad \\

U: 顧客全て \quad\quad\quad\quad\quad　\\
u: 顧客　\quad\quad\quad\quad\quad　\\
r(u,i): 顧客u による商品i の評価を返す関数　\\
i,j: 商品 \quad\quad\quad\quad\quad\quad　\\

「顧客ごとの商品に対する評価」をデータに含んでいれば上記式ですが、今回例にするような評価をもたないデータ構造の場合、よりシンプルに「購入した(1)」か「購入していない(0)」か、だけで評価します

一致しているところだけみるので、購入顧客群のコサイン類似度でよく、次の形になります

similarity（i,j） = \frac{\vec{I}\bullet\vec{J}}{  \| \boldsymbol{I}  \| *  \| \boldsymbol{J} \| } \\

\quad \\

\vec{I} : \{ u_{2},u_{4},u_{6}... \} \quad 商品i を購入した顧客群 \\
\vec{J} : \{ u_{3},u_{6},u_{9}... \} \quad 商品j を購入した顧客群 \\

式からは、商品iと商品jの組み合わせについて、

ともに購入した顧客群（積）の大きさ
どちらかを購入した顧客群（和）の大きさ

の２つをとり、割り算でスコアを求められることがわかります

実行環境

MariaDB 10.6
- SQL しか使わないので、ほかの要件はありません

データ

このくらいのデータ量とします

顧客数	商品数	明細数
10,000	10,000	100,000

組み合わせの数を対象とするので、顧客数・商品数が大きくなりすぎるときは、対象データを絞り込む（商品カテゴリや期間、購入実績などで絞る）とよさそうです
ノイズになるデータがありうるので、データに知見があれば先に削るといいと思います

所要時間

10分後には、レコメンドリストとスコアを手にしてニコニコできるでしょう（データ量とDB性能により前後しますmm）

集計手順

商品組み合わせごとの共通購入者（積集合）をだす
商品組み合わせごとの総購入者（和集合）をだす
商品ごとの類似度をだす

顧客群の積

商品組み合わせごとの共通購入者数
- 同じ組み合わせは除外する
  - （ID1×ID2 と ID2×ID1 は同じ組み合わせなのでどちらかは捨ててよい）
  - target側が小さいIDの組み合わせのみ残し、逆の組み合わせは捨てる

create table both_items_purchasers (
  target_item_id  bigint,
  counter_item_id bigint,
  both_purchasers bigint,
  primary key (target_item_id, counter_item_id)
);

insert into both_items_purchasers
select
  target_item_id,
  counter_item_id,
  both_purchasers
from
(
  select
    target_item_id,
    counter_item_id,
    count(distinct target_user_id) both_purchasers
  from
  (
    select
      case
        when target.item_id < counter.item_id then target.item_id
        else counter.item_id
      end target_item_id,
      case 
        when target.item_id < counter.item_id then counter.item_id
        else target.item_id
      end counter_item_id,
      target.user_id target_user_id,
      counter.user_id counter_user_id
    from 
    ( select user_id, item_id from sales ) target 
    join 
    ( select user_id, item_id from sales ) counter 
      on ( 
        target.user_id = counter.user_id 
        and 
        target.item_id != counter.item_id
      )
  ) a
  group by 
    target_item_id,
    counter_item_id
) a2;

select
  count(distinct target_item_id),
  count(distinct counter_item_id),
  count(*)
from
  both_items_purchasers;

対象商品数	組み合わせ商品数	組み合わせ件数
10,000	10,000	496,856

顧客群の和

商品組み合わせごとの総購入者数

create table all_purchasers (
  target_item_id  bigint,
  counter_item_id bigint,
  all_purchasers  bigint,
  primary key (target_item_id, counter_item_id)
);

insert into all_purchasers
select
   target_item_id,
   counter_item_id,
   count(distinct j.user_id) all_purchasers
from
(
  select
     target_item_id,
     counter_item_id,
     concat_ws(',', target.purchasers, counter.purchasers) array_purchasers
  from 
  (
    select target_item_id, counter_item_id from both_items_purchasers
  ) b
  join 
  (
    select item_id, group_concat(user_id) purchasers
    from sales
    group by item_id
  ) target
    on b.target_item_id = target.item_id
  join 
  (
    select item_id, group_concat(user_id) purchasers
    from sales
    group by item_id
  ) counter
    on b.counter_item_id = counter.item_id 
) a
join 
json_table(replace(json_array(a.array_purchasers), ',', '","'),'$[*]' columns (user_id varchar(10) path '$')) j
group by 
   target_item_id,
   counter_item_id
;

select
  count(distinct target_item_id),
  count(distinct counter_item_id), 
  count(*)
from
  all_purchasers;

対象商品数	組み合わせ商品数	組み合わせ件数
10,000	10,000	496,856

類似度スコア計算

商品組み合わせごと、共通購入者（積集合）数 / 両商品の総購入者（和集合）数

create table item_similarity_score (
  target_item_id  bigint,
  counter_item_id bigint,
  both_purchasers bigint,
  all_purchasers  bigint,
  similarity_score double
);

insert into item_similarity_score
select
  b.target_item_id,
  b.counter_item_id,
  both_purchasers,
  all_purchasers,
  ( both_purchasers / all_purchasers ) similarity_score
from 
(
  select
    target_item_id,
    counter_item_id,
    both_purchasers
  from
    both_items_purchasers
) b
join
(
  select
    target_item_id,
    counter_item_id,
    all_purchasers
  from
    all_purchasers
) a
  on ( 
    b.target_item_id = a.target_item_id 
    and 
    b.counter_item_id = a.counter_item_id 
  )
;

商品組み合わせごとの類似度スコアが計算できました

select target_item_id,counter_item_id,similarity_score from item_similarity_score;
+----------------+-----------------+------------------+
| target_item_id | counter_item_id | similarity_score |
+----------------+-----------------+------------------+
|              1 |               2 |             0.56 |
|              1 |               3 |       0.29347826 |
|              1 |               4 |      0.538461538 |
|              1 |               5 |              0.5 |
...

調整

最終的なスコア分布に偏りがなければ良いのですが、データによっては分布に偏りができます
（組み合わせ数に対して十分な取引データがなければ、スコアは全体的に低めに出そうです）

0-1 間で適度にばらけた方が読みやすいスコアになるので、用途により調整するといいかもです

select
  round(similarity_score,1),
  count(*) 
from item_similarity_score 
group by round(similarity_score,1);

+---------------------------+----------+
| round(similarity_score,1) | count(*) |
+---------------------------+----------+
|                       0.0 |   294598 |
|                       0.1 |   191941 |
|                       0.2 |     9012 |
|                       0.3 |      203 |
|                       0.4 |      101 |
|                       0.5 |        1 |
+---------------------------+----------+

正規分布になるよう調整する

create table `item_similarity_normalized_score` (
  `target_item_id`   bigint,
  `counter_item_id`  bigint,
  `similarity_score` double,
  primary key (target_item_id, counter_item_id)
);

insert into item_similarity_normalized_score
select
  target_item_id, 
  counter_item_id,
  case 
    when score > 1 then 1
    when score < 0 then 0
    else score
  end score
from
(
  select
    target_item_id, 
    counter_item_id,
    ((( similarity_score - avg_score ) / ( 10 * stddev_score )) + 0.5 ) score
  from
  (
    select 
      'A' dummy,
      target_item_id, 
      counter_item_id, 
      similarity_score
    from 
      item_similarity_score
  ) a
  join
  (
    select 
      'A' dummy,
      avg(similarity_score) avg_score,
      std(similarity_score) stddev_score
    from
      item_similarity_score
  ) b
    on a.dummy = b.dummy
) s;

スコア分布、適度にばらけました

select
 round(similarity_score,1),count(*)
from item_similarity_normalized_score
group by round(similarity_score,1);

+------------------+--------+
| similarity_score | cnt    |
+------------------+--------+
|              0.2 |    548 |
|              0.3 |  27282 |
|              0.4 | 134132 |
|              0.5 | 190714 |
|              0.6 | 110738 |
|              0.7 |  31571 |
|              0.8 |   4916 |
|              0.9 |    506 |
|              1.0 |     34 |
+------------------+--------+

評価

スコアが妥当かどうか、実際のデータをみて確認します（定性評価）

例として PLAYBACK9 の商品データで試してみると
現在までの販売商品のうち、最も類似度が高いのは商品2 と商品4 でした

select 
  target_item_id,
  counter_item_id,
  similarity_score 
from item_similarity_score 
order by similarity_score desc;

+----------------+-----------------+------------------+
| target_item_id | counter_item_id | similarity_score |
+----------------+-----------------+------------------+
|              2 |               4 |      0.837837837 |
...

該当商品は同価格帯の同じ選手の活躍シーンになっていて、ファンが特定選手の商品を選んで、複数種購入していることが窺えます

スコア高くなっている商品同士を確認し、それぞれ似ていることが確認できました

まとめ

SQLで手軽に「似ている商品」を調べる方法について書きました

同じデータ構造ならすぐ出せるので、お手元のデータでもぜひ試してみてください

余談

SQLのみで実験データ生成

実データに自由にクエリを投げれればいいのですが、お試しで処理時間とか測りたいときはダミーデータが欲しくなると思います

次のようなプロシージャで好きなだけダミー生成できるので、余談として書き残しておきます


SET @item_count=10000;
SET @data_count=1000000;

DELIMITER $$
CREATE PROCEDURE generate_sales()
BEGIN
  DECLARE i INT DEFAULT 1;
  WHILE i <= @data_count DO
    INSERT INTO `sales` (`user_id`,`item_id`) 
    VALUES (
      cast(ROUND(RAND() * @item_count,0) as bigint),
      cast(ROUND(RAND() * @item_count,0) as bigint)
    );
    SET i = i + 1;
  END WHILE;
END$$
DELIMITER ;

CALL generate_sales();

DeNAより

DeNAでは今年 2021年度新卒・2022年度新卒内定エンジニアの Advent Calendar もあります！
本 Advent Calendar とは違った種類、違った視点での記事をぜひお楽しみください！

▼DeNA 2021年度新卒エンジニア・2022年度新卒内定エンジニアによる Advent Calendar 2021
https://qiita.com/advent-calendar/2021/dena-21x22

また DeNA 公式 Twitter アカウント @DeNAxTech では、 Blog記事だけでなく様々な登壇の資料や動画も発信してます。ぜひフォローして下さい！
Follow @DeNAxTech

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up