More than 1 year has passed since last update.

Python初心者がkaggleを使ってECサイトの需要予測をやってみた！

Last updated at 2022-06-02Posted at 2022-06-02

https://www.kaggle.com/code/teraodaiyou/notebook740f191599
今回のブログで使用するkaggle notebookです。

1. 目次

Ⅰ. はじめに
Ⅱ. 実行環境
Ⅲ. データ分析実行内容
Ⅳ. 各用語について
Ⅴ. 売上予測までの流れについて
Ⅵ. 具体的なコードの記述について
Ⅶ. 考察
Ⅷ. 終わりに
Ⅸ. 参考文献

2. はじめに

産業機械メーカーに勤務しており、近年、『AI』や『DX』、『ビッグデータ』というワードをよく耳にすることから仕事でデータを使う機会が増えてくると考え、AIdemyの受講を決意いたしました。
2022年3月中旬よりデータ分析講座3ヶ月コースを開始し、現在2ヶ月超経過したところです。

私自身プログラミングは初めてで、パソコンの知識も乏しい状態でスタートしましたが、AIdemyの先生方からの的確なご指導やアドバイスをいただけたおかげで、なんとか最終課題の成果物まで漕ぎ着けることができました。
カリキュラムはPython入門から始まり、NumPyやPandasまではテキストに沿って進められましたが、データ分析の肝となる機械学習のデータクレンジングやデータハンドリング、さらに深層学習のディープラーニングに進むにつれ理解をするのに困難を極めました。

特に機械学習においては、データ収集作業とあわせてデータの前処理に全体の７～８割の作業時間を使うといわれていて大変重要な作業となります。データ整形・クレンジングが非常に大事だと認識させられました。
その経験を踏まえつつ、実務も想定しながら『ECサイトの需要予測』というデータ分析にチャレンジします。

需要予測によって、私のようにモノの生産に携わる人は在庫量や材料の仕入れ、人員配置などを判断するうえで大きな指標となりますし、出店や店舗開発、販売、マーケティングプランなどの計画立案に携わる方々にもとても役立つと思います。これからプログラミングを勉強しようと考えられている皆様の参考になれば幸いです。

3. 実行環境

PC:MacBook Pro
環境:kaggle notebook
Python ver:3.8.8

4. データ分析実行内容

本記事では世界中のデータサイエンティストが腕を競い合う「kaggle」というデータ分析のコンペの中から「Predict Future Sales」という、ロシアのソフトウェア会社から提供された店舗、商品別の販売データを利用し、次月の販売数を予測しました。いわゆる需要予測というものでAI活用が著しい領域の一つとなっています。
需要予測ができるメリットとしては発注業務の効率化、在庫の適正化、来客数の予測などが挙げられ、ビジネスにおいても様々な業種の方々にイメージいただきやすいと思います。

Kaggleのようなコンペでは、xgboostやLGBMといった勾配ブースティングがよく使われています。
勾配ブースティングには下記の様な特徴があり、データ分析のモデルで主流になっているようです。
今回はxgboostをモデルに使い実行しました。

XGBoostとは?

XGBoost（eXtreme Gradient Boosting）は、機械学習手法の中で教師あり学習に分類されます。
データ解析コンペのkaggleでも非常に多く使われますし、実務でもよく使われます。
簡単に実装することができ、非常に高い精度を出力できることが特徴です。

XGBoostは決定木とアンサンブル学習のひとつ、ブースティングを組み合わせた手法です。
決定木とはその名の通り、木構造でデータを分類していく手法で、そこそこの精度と結果の視認性からこちらも実務の場で用いられています。決定木単体ではそれほど精度は高くないのですが、この決定木にアンサンブル学習という方法を組み合わせることで最強の精度を叩き出す手法になります。

アンサンブル学習とは簡単にいうと複数のモデルを作って色々な方法で組み合わせていく手法です。
アンサンブル学習には主にバギング、ブースティング、スタッキングという三つの手法群がありますが、XGBoostではブースティングを用いています。

ブースティングを用いたXGBoostでは直列に複数の決定木を生成して精度を改善していきます。
前の決定木ではうまく判別できていなかった部分に焦点を当てて、次の決定木で学習していくイメージです。
単体だとうまく判別できない要素も複数の決定木を直列に組み合わせることで判別できるようになります。

勾配ブースティングの利点

・欠損値の補完が不要
・冗長な特徴量があっても問題ない
・ランダムフォレストとの違いは、木を直列に作っている点。

5. 各用語について

前処理とは？

機械学習など目的の作業をするためにデータを綺麗にしたり、加工したりして使えるカタチ（特徴量）にすること。取得したデータをそのまま扱えることはまずありません。日付のフォーマットにバラつきがあったら統一したり、「ねこ・ネコ・猫」のような表記揺らぎを統一したり、プログラムがうまく理解できるように、プログラムに渡す前に綺麗にしてあげる作業がデータ前処理です。

特徴量とは？

機械学習の予測モデルを作る上で、入力となるデータのことです。機械学習のプログラムに入力データをわたす際に、集めたデータの中ですべての列が必要なわけではありません。列を選別したり不足している列を追加したりしてチューニングを行っていきます。最終的に実際に入力値となる列データのことを特徴量といいます。特徴量を作成したり選別したりすることを特徴量エンジニアリング（Feature Engeneering）といいます。

6. 売上予測までの流れについて

1.データの確認
2.前処理
3.外れ値の除去
4.エンコード
5.特徴量の生成
6.ラグ特徴量の生成
7.その他の特徴量
8.モデル実装

7. 具体的なコードの記述について

ライブラリのインポート

#インポート
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 100)

#データ分割用
from itertools import product
from sklearn.preprocessing import LabelEncoder

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#XGboost
from xgboost import XGBRegressor
from xgboost import plot_importance

def plot_features(booster, figsize):    
    fig, ax = plt.subplots(1,1,figsize=figsize)
    return plot_importance(booster=booster, ax=ax)

import time
import sys
import gc
import pickle
sys.version_info

データの読み込み・結合

今回は下記の与えられたデータを使います。

ファイル名	データ内容
sales_train.csv.	学習データ
test.csv.	テストデータ
sample_submission.csv	提出データのサンプル
items.csv.	商品マスタ
item_categories.csv	商品カテゴリーデータ
shops.csv.	店舗マスタ

sales_train.csvとtest.csvを使用して学習を行い、sample_submission.csvの形式に合わせ予測を行います。
その他のデータはデータを理解するための参考として活用します。

#データ読み込み
items = pd.read_csv('../input/competitive-data-science-predict-future-sales/items.csv')
shops = pd.read_csv('../input/competitive-data-science-predict-future-sales/shops.csv')
cats = pd.read_csv('../input/competitive-data-science-predict-future-sales/item_categories.csv')
train = pd.read_csv('../input/competitive-data-science-predict-future-sales/sales_train.csv')
#後で落とさないようにインデックスをIDに設定する。
test  = pd.read_csv('../input/competitive-data-science-predict-future-sales/test.csv').set_index('ID')
print(items)

plt.figure(figsize=(10,4))
plt.xlim(-100, 3000)
sns.boxplot(x=train.item_cnt_day)

plt.figure(figsize=(10,4))
plt.xlim(train.item_price.min(), train.item_price.max()*1.1)
sns.boxplot(x=train.item_price)

データの修正、外れ値の除去。

上記の箱ひげ図より、各データで外れ値が存在していることが確認できました。
train.item_price>100000 および >1001 の外れ値を訓練データから削除します。

train = train[train.item_price<100000]
train = train[train.item_cnt_day<1001]

train.item_priceにて0以下の値が誤って存在しています。

median = train[(train.shop_id==32)&(train.item_id==2973)&(train.date_block_num==4)&(train.item_price>0)].item_price.median()
train.loc[train.item_price<0, 'item_price'] = median

いくつかのショップが互いに重複しているため、トレーニングセットとテストセットを修正します。

# 重複していた店名のIDを統一させます。(train/test両方で処理しておきます)
# Якутск Орджоникидзе, 56
train.loc[train.shop_id == 0, 'shop_id'] = 57
test.loc[test.shop_id == 0, 'shop_id'] = 57
# Якутск ТЦ "Центральный"
train.loc[train.shop_id == 1, 'shop_id'] = 58
test.loc[test.shop_id == 1, 'shop_id'] = 58
# Жуковский ул. Чкалова 39м²
train.loc[train.shop_id == 10, 'shop_id'] = 11
test.loc[test.shop_id == 10, 'shop_id'] = 11

shop/cats/itemsの前処理

上記のshopデータなどの観察から以下のことがわかります。
店名(shop_name)は、ロシアの各都市名で始まっています。
shop_nameの構成は [都市名店のタイプ "店名"]など必ず都市名で始まっています。
上記でshopデータを全件確認しているので、データ確認は省きます。
shopsデータから確認、前処理をしていきます。

shops.loc[shops.shop_name == 'Сергиев Посад ТЦ "7Я"', 'shop_name'] = 'СергиевПосад ТЦ "7Я"'
shops['city'] = shops['shop_name'].str.split(' ').map(lambda x: x[0])
shops.loc[shops.city == '!Якутск', 'city'] = 'Якутск'
shops['city_code'] = LabelEncoder().fit_transform(shops['city'])
shops = shops[['shop_id','city_code']]

cats['split'] = cats['item_category_name'].str.split('-')
cats['type'] = cats['split'].map(lambda x: x[0].strip())
cats['type_code'] = LabelEncoder().fit_transform(cats['type'])

cats['subtype'] = cats['split'].map(lambda x: x[1].strip() if len(x) > 1 else x[0].strip())
cats['subtype_code'] = LabelEncoder().fit_transform(cats['subtype'])
cats = cats[['item_category_id','type_code', 'subtype_code']]

items.drop(['item_name'], axis=1, inplace=True)

月別売上高

test.csvは2015年11月の月次売上を求めるために商品ID/店IDの組み合わせから構成されています。
その組み合わせの数は商品数(5100アイテム) * 店数(42ショップ) = 214200ペアあります。
testに存在して、trainに存在しない商品は363個あります。

したがって、これらの商品に対しての目的変数(今回は月次売上)は予測できないので、0でなければなりません。
一方、トレーニングセットには過去に販売されたか返品されたペアのみが含まれています。
つまり、毎月の売上を計算し、その月の各ペアの売上をゼロにして拡張するのです。
このようにして、訓練データはテストデータに類似したものになります。

len(list(set(test.item_id) - set(test.item_id).intersection(set(train.item_id)))), len(list(set(test.item_id))), len(test)

ts = time.time()
matrix = []
cols = ['date_block_num','shop_id','item_id']
for i in range(34):
    sales = train[train.date_block_num==i]
    matrix.append(np.array(list(product([i], sales.shop_id.unique(), sales.item_id.unique())), dtype='int16'))
    
matrix = pd.DataFrame(np.vstack(matrix), columns=cols)
matrix['date_block_num'] = matrix['date_block_num'].astype(np.int8)
matrix['shop_id'] = matrix['shop_id'].astype(np.int8)
matrix['item_id'] = matrix['item_id'].astype(np.int16)
matrix.sort_values(cols,inplace=True)
time.time() - ts

訓練セット各月のitem/shopペアの積として行列を作成します。
訓練セットをショップとアイテムのペアで集計し、ターゲット集計値を算出し、クリップ(0,20)ターゲット値を算出します。こうすることで、トレーニングのターゲットがテストの予測値に近くなります。
item_cnt_monthをint型ではなくfloat型にしているのは、後でテストセットと連結したときにダウンキャストされないようにするためです。int16だとNaNと連結した後にint64になりますが、foat16だとNaNでもfloat16になります。

# trainデータにrevenue(その日の収支合計)を追加します。
train['revenue'] = train['item_price'] *  train['item_cnt_day']

tem_cnt_month を (0,20) で切り出します。

ts = time.time()
group = train.groupby(['date_block_num','shop_id','item_id']).agg({'item_cnt_day': ['sum']})
group.columns = ['item_cnt_month']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=cols, how='left')
matrix['item_cnt_month'] = (matrix['item_cnt_month']
                                .fillna(0)
                                .clip(0,20) 
                                .astype(np.float16))
time.time() - ts

テストセット
行列にtestを追加し、34個の月のnansをゼロで埋めます。

test['date_block_num'] = 34
test['date_block_num'] = test['date_block_num'].astype(np.int8)
test['shop_id'] = test['shop_id'].astype(np.int8)
test['item_id'] = test['item_id'].astype(np.int16)

ts = time.time()
matrix = pd.concat([matrix, test], ignore_index=True, sort=False, keys=cols)
matrix.fillna(0, inplace=True) # 34 month
time.time() - ts

特徴量の生成

ショップ/アイテム/カテゴリを行列にマージします。

ts = time.time()
matrix = pd.merge(matrix, shops, on=['shop_id'], how='left')
matrix = pd.merge(matrix, items, on=['item_id'], how='left')
matrix = pd.merge(matrix, cats, on=['item_category_id'], how='left')
matrix['city_code'] = matrix['city_code'].astype(np.int8)
matrix['item_category_id'] = matrix['item_category_id'].astype(np.int8)
matrix['type_code'] = matrix['type_code'].astype(np.int8)
matrix['subtype_code'] = matrix['subtype_code'].astype(np.int8)
time.time() - ts

ターゲット・ラグ特徴を追加します。

def lag_feature(df, lags, col):
    tmp = df[['date_block_num','shop_id','item_id',col]]
    for i in lags:
        shifted = tmp.copy()
        shifted.columns = ['date_block_num','shop_id','item_id', col+'_lag_'+str(i)]
        shifted['date_block_num'] += i
        df = pd.merge(df, shifted, on=['date_block_num','shop_id','item_id'], how='left')
    return df

ts = time.time()
matrix = lag_feature(matrix, [1,2,3,6,12], 'item_cnt_month')
time.time() - ts

平均エンコード機能を追加

ts = time.time()
group = matrix.groupby(['date_block_num']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_avg_item_cnt' ]
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num'], how='left')
matrix['date_avg_item_cnt'] = matrix['date_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], 'date_avg_item_cnt')
matrix.drop(['date_avg_item_cnt'], axis=1, inplace=True)
time.time() - ts

ts = time.time()
group = matrix.groupby(['date_block_num', 'item_id']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_item_avg_item_cnt' ]
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num','item_id'], how='left')
matrix['date_item_avg_item_cnt'] = matrix['date_item_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1,2,3,6,12], 'date_item_avg_item_cnt')
matrix.drop(['date_item_avg_item_cnt'], axis=1, inplace=True)
time.time() - ts

ts = time.time()
group = matrix.groupby(['date_block_num', 'shop_id']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_shop_avg_item_cnt' ]
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num','shop_id'], how='left')
matrix['date_shop_avg_item_cnt'] = matrix['date_shop_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1,2,3,6,12], 'date_shop_avg_item_cnt')
matrix.drop(['date_shop_avg_item_cnt'], axis=1, inplace=True)
time.time() - ts

ts = time.time()
group = matrix.groupby(['date_block_num', 'item_category_id']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_cat_avg_item_cnt' ]
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num','item_category_id'], how='left')
matrix['date_cat_avg_item_cnt'] = matrix['date_cat_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], 'date_cat_avg_item_cnt')
matrix.drop(['date_cat_avg_item_cnt'], axis=1, inplace=True)
time.time() - ts

ts = time.time()
group = matrix.groupby(['date_block_num', 'shop_id', 'item_category_id']).agg({'item_cnt_month': ['mean']})
group.columns = ['date_shop_cat_avg_item_cnt']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num', 'shop_id', 'item_category_id'], how='left')
matrix['date_shop_cat_avg_item_cnt'] = matrix['date_shop_cat_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], 'date_shop_cat_avg_item_cnt')
matrix.drop(['date_shop_cat_avg_item_cnt'], axis=1, inplace=True)
time.time() - ts

ts = time.time()
group = matrix.groupby(['date_block_num', 'shop_id', 'type_code']).agg({'item_cnt_month': ['mean']})
group.columns = ['date_shop_type_avg_item_cnt']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num', 'shop_id', 'type_code'], how='left')
matrix['date_shop_type_avg_item_cnt'] = matrix['date_shop_type_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], 'date_shop_type_avg_item_cnt')
matrix.drop(['date_shop_type_avg_item_cnt'], axis=1, inplace=True)
time.time() - ts

ts = time.time()
group = matrix.groupby(['date_block_num', 'shop_id', 'subtype_code']).agg({'item_cnt_month': ['mean']})
group.columns = ['date_shop_subtype_avg_item_cnt']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num', 'shop_id', 'subtype_code'], how='left')
matrix['date_shop_subtype_avg_item_cnt'] = matrix['date_shop_subtype_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], 'date_shop_subtype_avg_item_cnt')
matrix.drop(['date_shop_subtype_avg_item_cnt'], axis=1, inplace=True)
time.time() - ts

ts = time.time()
group = matrix.groupby(['date_block_num', 'city_code']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_city_avg_item_cnt' ]
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num', 'city_code'], how='left')
matrix['date_city_avg_item_cnt'] = matrix['date_city_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], 'date_city_avg_item_cnt')
matrix.drop(['date_city_avg_item_cnt'], axis=1, inplace=True)
time.time() - ts

ts = time.time()
group = matrix.groupby(['date_block_num', 'item_id', 'city_code']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_item_city_avg_item_cnt' ]
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num', 'item_id', 'city_code'], how='left')
matrix['date_item_city_avg_item_cnt'] = matrix['date_item_city_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], 'date_item_city_avg_item_cnt')
matrix.drop(['date_item_city_avg_item_cnt'], axis=1, inplace=True)
time.time() - ts

ts = time.time()
group = matrix.groupby(['date_block_num', 'type_code']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_type_avg_item_cnt' ]
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num', 'type_code'], how='left')
matrix['date_type_avg_item_cnt'] = matrix['date_type_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], 'date_type_avg_item_cnt')
matrix.drop(['date_type_avg_item_cnt'], axis=1, inplace=True)
time.time() - ts

ts = time.time()
group = matrix.groupby(['date_block_num', 'subtype_code']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_subtype_avg_item_cnt' ]
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num', 'subtype_code'], how='left')
matrix['date_subtype_avg_item_cnt'] = matrix['date_subtype_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], 'date_subtype_avg_item_cnt')
matrix.drop(['date_subtype_avg_item_cnt'], axis=1, inplace=True)
time.time() - ts

トレンド機能を追加します。
過去6ヶ月間の価格推移です。

ts = time.time()
group = train.groupby(['item_id']).agg({'item_price': ['mean']})
group.columns = ['item_avg_item_price']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['item_id'], how='left')
matrix['item_avg_item_price'] = matrix['item_avg_item_price'].astype(np.float16)

group = train.groupby(['date_block_num','item_id']).agg({'item_price': ['mean']})
group.columns = ['date_item_avg_item_price']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num','item_id'], how='left')
matrix['date_item_avg_item_price'] = matrix['date_item_avg_item_price'].astype(np.float16)

lags = [1,2,3,4,5,6]
matrix = lag_feature(matrix, lags, 'date_item_avg_item_price')

for i in lags:
    matrix['delta_price_lag_'+str(i)] = \
        (matrix['date_item_avg_item_price_lag_'+str(i)] - matrix['item_avg_item_price']) / matrix['item_avg_item_price']

def select_trend(row):
    for i in lags:
        if row['delta_price_lag_'+str(i)]:
            return row['delta_price_lag_'+str(i)]
    return 0
    
matrix['delta_price_lag'] = matrix.apply(select_trend, axis=1)
matrix['delta_price_lag'] = matrix['delta_price_lag'].astype(np.float16)
matrix['delta_price_lag'].fillna(0, inplace=True)

fetures_to_drop = ['item_avg_item_price', 'date_item_avg_item_price']
for i in lags:
    fetures_to_drop += ['date_item_avg_item_price_lag_'+str(i)]
    fetures_to_drop += ['delta_price_lag_'+str(i)]

matrix.drop(fetures_to_drop, axis=1, inplace=True)

time.time() - ts

直近1ヶ月のショップ売上高の推移

ts = time.time()
group = train.groupby(['date_block_num','shop_id']).agg({'revenue': ['sum']})
group.columns = ['date_shop_revenue']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num','shop_id'], how='left')
matrix['date_shop_revenue'] = matrix['date_shop_revenue'].astype(np.float32)

group = group.groupby(['shop_id']).agg({'date_shop_revenue': ['mean']})
group.columns = ['shop_avg_revenue']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['shop_id'], how='left')
matrix['shop_avg_revenue'] = matrix['shop_avg_revenue'].astype(np.float32)

matrix['delta_revenue'] = (matrix['date_shop_revenue'] - matrix['shop_avg_revenue']) / matrix['shop_avg_revenue']
matrix['delta_revenue'] = matrix['delta_revenue'].astype(np.float16)

matrix = lag_feature(matrix, [1], 'delta_revenue')

matrix.drop(['date_shop_revenue','shop_avg_revenue','delta_revenue'], axis=1, inplace=True)
time.time() - ts

matrix['month'] = matrix['date_block_num'] % 12

1ヶ月の日数。うるう年はありません。

days = pd.Series([31,28,31,30,31,30,31,31,30,31,30,31])
matrix['days'] = matrix['month'].map(days).astype(np.int8)

ショップとアイテムのペア、およびアイテムのみの最終販売からの月数。
キーが{shop_id,item_id}で値がdate_block_numのHashTableを作成します。
データを先頭から順にイテレートします。各行で {row.shop_id,row.item_id} がテーブルになければ、それをテーブルに追加し、その値をrow.date_block_numに設定。HashTableにキーがあれば、キャッシュされた値とrow.date_block_numとの差を計算します。

ts = time.time()
cache = {}
matrix['item_shop_last_sale'] = -1
matrix['item_shop_last_sale'] = matrix['item_shop_last_sale'].astype(np.int8)
for idx, row in matrix.iterrows():    
    key = str(row.item_id)+' '+str(row.shop_id)
    if key not in cache:
        if row.item_cnt_month!=0:
            cache[key] = row.date_block_num
    else:
        last_date_block_num = cache[key]
        matrix.at[idx, 'item_shop_last_sale'] = row.date_block_num - last_date_block_num
        cache[key] = row.date_block_num         
time.time() - ts

ts = time.time()
cache = {}
matrix['item_last_sale'] = -1
matrix['item_last_sale'] = matrix['item_last_sale'].astype(np.int8)
for idx, row in matrix.iterrows():    
    key = row.item_id
    if key not in cache:
        if row.item_cnt_month!=0:
            cache[key] = row.date_block_num
    else:
        last_date_block_num = cache[key]
        if row.date_block_num>last_date_block_num:
            matrix.at[idx, 'item_last_sale'] = row.date_block_num - last_date_block_num
            cache[key] = row.date_block_num         
time.time() - ts

各ショップ/アイテムのペア、アイテムのみの初回販売からの月数。

ts = time.time()
matrix['item_shop_first_sale'] = matrix['date_block_num'] - matrix.groupby(['item_id','shop_id'])['date_block_num'].transform('min')
matrix['item_first_sale'] = matrix['date_block_num'] - matrix.groupby('item_id')['date_block_num'].transform('min')
time.time() - ts

最終準備

ラグ値として12を使用しているため、最初の年をカットし、テストセットで計算できない列をすべて削除します。

ts = time.time()
matrix = matrix[matrix.date_block_num > 11]
time.time() - ts

ts = time.time()
def fill_na(df):
    for col in df.columns:
        if ('_lag_' in col) & (df[col].isnull().any()):
            if ('item_cnt' in col):
                df[col].fillna(0, inplace=True)         
    return df

matrix = fill_na(matrix)
time.time() - ts

matrix.columns

data_setの完成

matrix.to_pickle('data.pkl')
del matrix
del cache
del group
del items
del shops
del cats
del train
#テストを送信するために残す。
gc.collect();

xgboostモデルの構築

data = pd.read_pickle('data.pkl')

data = data[[
    'date_block_num',
    'shop_id',
    'item_id',
    'item_cnt_month',
    'city_code',
    'item_category_id',
    'type_code',
    'subtype_code',
    'item_cnt_month_lag_1',
    'item_cnt_month_lag_2',
    'item_cnt_month_lag_3',
    'item_cnt_month_lag_6',
    'item_cnt_month_lag_12',
    'date_avg_item_cnt_lag_1',
    'date_item_avg_item_cnt_lag_1',
    'date_item_avg_item_cnt_lag_2',
    'date_item_avg_item_cnt_lag_3',
    'date_item_avg_item_cnt_lag_6',
    'date_item_avg_item_cnt_lag_12',
    'date_shop_avg_item_cnt_lag_1',
    'date_shop_avg_item_cnt_lag_2',
    'date_shop_avg_item_cnt_lag_3',
    'date_shop_avg_item_cnt_lag_6',
    'date_shop_avg_item_cnt_lag_12',
    'date_cat_avg_item_cnt_lag_1',
    'date_shop_cat_avg_item_cnt_lag_1',
    #'date_shop_type_avg_item_cnt_lag_1',
    #'date_shop_subtype_avg_item_cnt_lag_1',
    'date_city_avg_item_cnt_lag_1',
    'date_item_city_avg_item_cnt_lag_1',
    #'date_type_avg_item_cnt_lag_1',
    #'date_subtype_avg_item_cnt_lag_1',
    'delta_price_lag',
    'month',
    'days',
    'item_shop_last_sale',
    'item_last_sale',
    'item_shop_first_sale',
    'item_first_sale',
]]

検証戦略は、テストセットが34ヶ月、検証セットが33ヶ月、トレーニングが13-33ヶ月です。

#訓練用データ
X_train = data[data.date_block_num < 33].drop(['item_cnt_month'], axis=1)
Y_train = data[data.date_block_num < 33]['item_cnt_month']
#バリデーション用データ
X_valid = data[data.date_block_num == 33].drop(['item_cnt_month'], axis=1)
Y_valid = data[data.date_block_num == 33]['item_cnt_month']
X_test = data[data.date_block_num == 34].drop(['item_cnt_month'], axis=1)

del data
gc.collect();

作成した特徴量を元に機械学習を行います。今回使用したアルゴリズムはxgboost、ハイパーパラメータは下記の通りです。学習用として13～32か月目のデータを使用し33か月目が訓練データです。
※最初の12か月分はラグ情報(12か月前の販売数など)が無いので学習データとしては使用しません。

early_stopping_roundsは連続で10回精度が向上しなければ、学習を打ち切るという意味です。

ts = time.time()

model = XGBRegressor(
    max_depth=8,
    n_estimators=1000,
    min_child_weight=300, 
    colsample_bytree=0.8, 
    subsample=0.8, 
    eta=0.3,    
    seed=42)

model.fit(
    X_train, 
    Y_train, 
    eval_metric="rmse", 
    eval_set=[(X_train, Y_train), (X_valid, Y_valid)], 
    verbose=True, 
    early_stopping_rounds = 10)

time.time() - ts

モデルトレーニングからの出力

Y_pred = model.predict(X_valid).clip(0, 20)
Y_test = model.predict(X_test).clip(0, 20)

submission = pd.DataFrame({
    "ID": test.index, 
    "item_cnt_month": Y_test
})
submission.to_csv('xgb_submission.csv', index=False)

#アンサンブルのための予測を保存する。
pickle.dump(Y_pred, open('xgb_train.pickle', 'wb'))
pickle.dump(Y_test, open('xgb_test.pickle', 'wb'))

plot_features(model, (10,14))

feature_importanceを可視化してみます。
今回の学習で重要だった特徴量は下図の通りです。delta_price_lagが一番効いてます。

６.考察

時系列で見てみると減少傾向にあり、毎年12月に同じような感じで売上個数が上がっていて季節性が確認できます。このような点を変数として入れると精度が上がりそうです。

７.終わりに

今回、機械学習を使った商品売上予測を試してみました。XGBoostはKaggleなどのコンペティションでは大変人気なアルゴリズムですが、コードの意味を理解することが大変難しかったです。

今後、他にも精度が高い『LightGBM』や『CatBoost』といった同じGBDTというアルゴリズムによるモデルを使って３つのモデルの平均値を予測することにチャレンジしていきたいと考えております。
最後に私がオススメする機械学習を学ぶ方法は、機械学習エンジニアからいつでも質問できる環境で学ぶことです。ご高覧ありがとうございました。

８.参考文献

・Predict Future Sales　kaggleデータ分析コンペ
・XGboost　Qiitaブログ解説
・GBDT　Qiitaブログ解説

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up