More than 5 years have passed since last update.

python で csv, tsvデータを行列に変換する - MovieLensを例に

Last updated at 2016-02-27Posted at 2016-02-26

pythonで機会学習入門中です。

どのようなアルゴリズムを扱うにせよ、csvあるいはtsv形式のサンプルデータを行列に変換することは必須になるので、いくつか方法を調べてみました。

今回、サンプルデータは協調フィルタリングのベンチマークで最もよく使われると言われている MovieLens Dataset のうち 100K のもの MovieLens 100K Dataset を用います。

MovieLens Dataset

Datasetについて詳しくは README を読めばいいんですが、主に使うのは u.data になるかと思います。

user_id, item_id, rating, timestamp の 4カラムのtsvです。

196	242	3	881250949
186	302	3	891717742
22	377	1	878887116
244	51	2	880606923
166	346	1	886397596
298	474	4	884182806
...

最終的には、ユーザ i のアイテム j に対する評価 rating を R(i,j) = rating とするような行列に変換したい。

標準csvモジュールを使う

with open('u.data', newline='') as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
        print(row)

読み出すだけなら、csvモジュールだけで十分です。
ただ、行列Rのshapeを決めるために、user_id と item_id の最大値を求めておく必要があります。

pandas でcsvを扱う

pandas: powerful Python data analysis toolkit を用いると、データハンドリングが捗ります。

インストールは pip install pandas で、csv の列ごとの最大値を求める方法は以下の通りです。

>>> df = pd.read_csv('u.data', sep='\t', names=['user_id','item_id', 'rating', 'timestamp'])

>>> df.max()
user_id            943
item_id           1682
rating               5
timestamp    893286638
dtype: int64

ここで、df は DataFrame オブジェクトで、df.max() は Series オブジェクトです。

>>> type(df)
<class 'pandas.core.frame.DataFrame'>

>>> type(df.max())
<class 'pandas.core.series.Series'>

各カラムの最大値にアクセスするには、以下のようにすれば OK です。

>>> df.max().ix['user_id']
943
>>> df.max().ix['item_id']
1682

日本語での解説記事は http://oceanmarine.sakura.ne.jp/sphinx/group/group_pandas.html が分かりやすいです。

目的の行列に変換する

ここまで来れば、あとは1データずつ真面目に処理するだけです。

import numpy as np
import pandas as pd

df = pd.read_csv('u.data', sep='\t', names=['user_id','item_id', 'rating', 'timestamp'])

shape = (df.max().ix['user_id'], df.max().ix['item_id'])
R = np.zeros(shape) 

for i in df.index:
    row = df.ix[i]
    R[row['user_id'] -1 , row['item_id'] - 1] = row['rating']


>>> print(R)
[[ 5.  3.  4. ...,  0.  0.  0.]
 [ 4.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ...,
 [ 5.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  5.  0. ...,  0.  0.  0.]]

一般的には疎行列になる（多数ある映画の中で、1人が評価する数はどうしても限られる）なので、sparse を使うほうがよさそうです。

import numpy as np
import pandas as pd
from scipy import sparse

df = pd.read_csv('u.data', sep='\t', names=['user_id','item_id', 'rating', 'timestamp'])

shape = (df.max().ix['user_id'] + 1, df.max().ix['item_id'] + 1)
R = sparse.lil_matrix(shape) 

for i in df.index:
    row = df.ix[i]
    R[row['user_id'], row['item_id']] = row['rating']

>>> print(R.todense())
[[ 5.  3.  4. ...,  0.  0.  0.]
 [ 4.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ...,
 [ 5.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  5.  0. ...,  0.  0.  0.]]

以上。

訂正

1行目と1列目が余分だった問題を発見したので、修正済み。
初稿では以下のように書いていました。。

shape = (df.max().ix['user_id'] + 1, df.max().ix['item_id'] + 1)
R = np.zeros(shape) 

for i in df.index:
    row = df.ix[i]
    R[row['user_id'], row['item_id']] = row['rating']

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up