LoginSignup
1
0

More than 3 years have passed since last update.

【Pandasサンプルコード】購入ログっぽいサンプルデータ作成&集計

Last updated at Posted at 2019-12-01

サンプルデータ作成

import numpy as np
import pandas as pd
from datetime import datetime, timedelta
from math import ceil 

# create sample data ---------------
## function
def repeat_copy_and_random_choice(values_list, sample_length):
    repeat_rate = 5
    repeat_copied = values_list * ceil(sample_length / len(values_list) * repeat_rate)
    random_choiced = np.random.choice(repeat_copied, sample_length)
    return random_choiced

## parameters
data_length = 100
users = [c for c in 'ABCDEFGHIJ']
items = [c for c in 'abcdefghijklmnopqrstuvwxyz']
order_dates =  list(np.arange(datetime(2019,9,1), datetime(2019,9,30), timedelta(days=1)).astype(datetime))
item_price_master = pd.DataFrame()

## DataFrame
df = pd.DataFrame()
df['order_date'] = repeat_copy_and_random_choice(order_dates, data_length)
df['user'] = repeat_copy_and_random_choice(users, data_length)
df['item'] = repeat_copy_and_random_choice(items, data_length)
df['quantity'] = np.random.randint(1,10, data_length)
item_price_master['item'] = items
item_price_master['unit_price'] = [np.ceil(x * 1000) for x in np.random.random(len(items))]
df = df.merge(item_price_master, on='item', how='left')
df = df.sort_values(by=['order_date', 'user', 'item']).reset_index(drop=True)

こんなデータ
image.png
「購入日、ユーザー名、商品名、購入個数、単価」的なイメージ
※ランダム要素あるので、実行する度に変わる(シードは固定してない)

計算例1:ユーザーごとの 購入額合計/購入日数合計

## purchase amount / (date & user)
(df
.assign(price=lambda xdf: xdf['unit_price'] * xdf['quantity'])
.groupby('user')
.agg({
    'order_date':pd.Series.nunique,
    'price':pd.Series.sum,
    })
.assign(price_per_date=lambda xdf: (xdf['price'] / xdf['order_date']).astype(int))
.sort_values(by='price_per_date', ascending=False)
)

## purchase amount / (date & user):別の書き方(まどろっこしいが備忘)
def tmp1(srs):
    x = int(srs['price'] / srs['order_date'])
    return pd.Series(data=[srs['order_date'], srs['price'], x],
                    index=['order_date', 'price', 'price_per_date'])

(df
.assign(price=lambda xdf: xdf['unit_price'] * xdf['quantity'])
.groupby('user')
.agg({
    'order_date':pd.Series.nunique,
    'price':pd.Series.sum,
    })
.apply(tmp1, axis=1)
.sort_values(by='price_per_date', ascending=False)
)

結果はこんな感じ
image.png
※上のコードと下のコードで同じ計算してるが、表示フォーマットは若干変わるかも

計算例2:ユーザーごとの 購入間隔日数の中央値

def calc_med_diff_date(xdf):
    # 日付が異なるところだけ残す
    flags = (
        (xdf['order_date_prev'].notnull()) &
        (xdf['order_date']!=xdf['order_date_prev'])
    )
    tmp = xdf.loc[flags, :]
    avg_diff_date = (tmp['order_date'] -  tmp['order_date_prev']).median()
    return avg_diff_date

(df
.sort_values(by=['user', 'order_date'], ascending=True)
.assign(order_date_prev=lambda xdf: xdf.groupby('user')['order_date'].shift(1))
[['user','order_date','order_date_prev']]  # 別にこれはなくてもいい
.groupby('user')
.apply(calc_med_diff_date)
.sort_values(ascending=True)
)

結果はこんな感じ
image.png

おわり

1
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
0