More than 5 years have passed since last update.

疎行列を使った簡単な実用規模レコメンデーションの実装

Posted at 2019-01-07

目的

１０００万件程度の購買履歴情報から正値行列分解方式で顧客、商品の嗜好ベクトルを生成する簡易なレコメンデーションシステムの実装にあたり、いくつかの障害に遭遇したのでメモ代わりの記録に残すこととした。

疎行列化に伴う課題

今日、scikit-learnやTensorFlow をpandas とともに用いれば極めて容易に協調フィルタリング方式のレコメンデーションシステムを作製することが出来る。ただし、データ量の増加とともに　「顧客ｘ商品」などの関係マトリックスは巨大化するため、小規模システムで高速なdense な行列の実装では実用的システムでは稼働しなくなってしまう。特に

取引データから「顧客vs商品」などのPivot集計
集計を元にした NMF による分解行列の生成

前者は pandas のpivot が疎行列に対応していない問題で修正要望も出ているようだが、現時点では対応されてないようだ。また、NMFに関しては、scikit-learnのNMFモジュール自体はドキュメント上もsparse対応しているようだが筆者の用いたデータサイズの場合、途中で行われるempty 処理が巨大なdense行列を初期化しようとしてうまく行かなかった。使い方にコツがあるのかも知れない。

集計による疎行列の作製

そこで、まず集計の方だが、stack overflowのこの記事を参考にして　scipyのcsr_matrixの初期化機能csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])を用いることでpivot集計が出来ることが判った。コードとしては次のようなものとなる。index_labe col_label　がpivot集計したいカラムであり、valueが集計対象となる値。入力のtableはpandasのDataFrameである。

import pandas as pd
from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype

def SparsePivot(table,index_label,col_label,val_label):
    index_ca = CategoricalDtype(sorted(table.loc[:,index_label].unique()),ordered=True)
    col_ca = CategoricalDtype(sorted(table.loc[:,col_label].unique()),ordered=True)
    row = table.loc[:,index_label].astype(index_ca).cat.codes
    col = table.loc[:,col_label].astype(col_ca).cat.codes
    sparse_matrix = csr_matrix((table[val_label],(row,col)),
                                shape=(index_ca.categories.size,col_ca.categories.size))
    dfs = pd.SparseDataFrame(sparse_matrix,index=index_ca.categories,columns=col_ca.categories,default_fill_value=0)
    return(dfs)

これで分解の対象となる正規化前のマトリックスが出来る。１０００万件程度のＤａｔａＦｒａｍｅをtableとして渡しても短時間に集計してくれる。

疎行列の正値行列分解

つぎは正値行列分解。前述のように本来、scikit-learn のNMFを使えるはずなのでまずはそちらを試して欲しい。私の場合は、上記の通りいくつかの初期化、分解方式を試したが作業用のパソコン上で50万x10万　程度の疎行列に対してはメモリー不足で動作（３２Ｇ）しなかった。
そこで、以下のようにアルゴリズムの段階からNMF分解を書き起こした。デバグコードが入っているのはご容赦を。update_uvが実質てきな処理で、解析対象のデータに会わせて特徴ベクトルの次元などは決め打ちしてある。

import numpy as np
def update_uv(M, U, V):
    divu = np.dot(U,np.dot(V.T, V))
    divu = np.where(divu == 0,0.1,divu)
    U = U * M.dot(V) / divu
    divv = np.dot(np.dot(U.T, U), V.T).T
    divv = np.where(divv == 0,0.1,divv)
    V = V * (M.T).dot(U) / divv
    return U, V

def sparseNMF(sparse_matrix):
    k = 16
    m = sparse_matrix.shape[0]
    n = sparse_matrix.shape[1]
    U = np.abs(np.random.uniform(low=0,high=1,size=(m,k)))
    V = np.abs(np.random.uniform(low=0,high=1,size=(n,k)))
    iters = 10000
    minU = 1000
    ng_max = 100
    ng_count = 0
    for _ in range(iters):
        print('count = '+str(_))
        lastU = U
        lastV = V
        U, V = update_uv(sparse_matrix, U, V)
        lastU = U - lastU
        diffU = np.linalg.norm(lastU)
        nrmU = np.linalg.norm(U)

        lastV = V - lastV
        diffV = np.linalg.norm(lastV)
        nrmV = np.linalg.norm(V)
        diffall = diffU*diffV
        print ('result norm U is '+str(nrmU)+' and diff norm is '+str(diffU))
        print ('result norm V is '+str(nrmV)+' and diff norm is '+str(diffV))
        print ('diffall = '+str(diffall))
        if diffall<minU:
            minU = diffall
            optU = U
            ng_count=0
        else:
            ng_count+=1
            print ('ng_count = '+str(ng_count))
            if ng_count>ng_max:
                break
    U = normalize(U,axis=1)
    V = normalize(V,axis=1)
    return(U,V)

データサイエンスの時代と言われているがPython などを用いることで機械学習をシステムに取り込む敷居はかなり下がっている。ただし、実用的なデータ規模に適用する場合の障害はまだまだ多いようだ。同様な課題で悩んでいる方がいるのであれば今後も情報の共有を行っていきたい。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up