More than 5 years have passed since last update.

冨井さんプログラム２

プログラム

Last updated at 2020-02-21Posted at 2020-02-21

---------------------------------

データ等の準備

----------------------------------

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
sns.set(font="IPAexGothic",style="white")

pandas環境設定

pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 1000)

train_xは学習データ、train_yは目的変数、test_xはテストデータ

pandasのDataFrame, Seriesで保持します。（numpyのarrayで保持することもあります）

train = pd.read_csv('./train2.csv',encoding="SHIFT-JIS")

test = pd.read_csv('./test2.csv',encoding="SHIFT-JIS")

train = pd.read_csv('./train.csv')
test = pd.read_csv('./test.csv')

train["t"] = 1
test["t"] = 0

学習データとテストデータを纏める

dat = pd.concat([train,test],sort=True).reset_index(drop=True)

インデックスを日付順に振りなおす

dat.index = pd.to_datetime(dat["datetime"])
dat = dat.reset_index(drop=True)

---------------------------------

初期データ確認

----------------------------------

1.標準統計量

学習データ

train.describe()

テストデータ

test.describe()

学習＋テストデータ

dat.describe()

2.欠損数、データ型確認

学習データ

train.info()

テストデータ

test.info()

学習＋テストデータ

dat.info()

3.欠損値確認とデータ加工方針

event列

dat['event'].unique()

(方針)event列は、「ママの会」「キャリアアップ支援セミナー」のone-hot列を作成し、event列自体はdropする

kcal列

dat['kcal'].unique()

kcalがいつ欠損値が発生しているか確認

dat.loc[dat['kcal'].isnull() , ['datetime', 'kcal','t','name','remarks']]

kcalの読み取れること

・2013/11/18(学習データ初日）～2013/12/26まではkcalが無い。

　⇒2014年からkcalを取り始めたと考えられる。

・2014年以降でも学習データ、テストデータで欠損値有り。ただし、remarksがお楽しみメニューorスペシャルメニュー(800円)の時である。

(方針)kcalの欠損値は欠損値が発生していないkcalの平均で埋める。

payday列

dat['payday'].unique()

(方針)payday列欠損値は0で埋める

remarks列

dat['remarks'].unique()

remarksが設定されている内容確認

dat.loc[dat['remarks'].notnull() , ['datetime','remarks','t','name','kcal']]

(方針)remarks列は、「お楽しみメニュー」「料理長のこだわりメニュー」「近隣に飲食店複合ビルオープン」のone-hot列を作成し、remarks列自体はdropする

・2014-1-24の「鶏のレモンペッパー焼（50食）、カレー（42食）」、2014-2-21の「酢豚（28食）、カレー（85食）」は、2種類の弁当を売っているが、テストデータでこのパターンが無いので、レコードをドロップする。

・2014-9-26の「スペシャルメニュー（800円）」はメニュー名「ランチビュッフェ」からお弁当ではないと判断できるのでレコードをdropする。

・2014-9-2の「手作りの味」は、テストデータで出てこないので、特徴量として残さない。

・2014-11-13の「近隣に飲食店複合ビルオープン」は意味合いから弁当屋のライバル出現である。しかしテストデータにしか出てこない特徴である。また、2014-11-14以降も影響が出ると思われるので、その特徴量も作成する。ただし、モデル作成の際には影響しない特徴量となるため、推論実行後に補正する（推定で-5～-10食で計算してみる）

weather列

dat['weather'].unique()

precipitation列

dat['precipitation'].unique()

weather列と# precipitation列の組み合わせ確認

dat_weather =dat.groupby(['weather','precipitation'],as_index=False)
dat_weather.mean()

precipitationが「--」は、weatherが「快晴」「晴れ」「曇」「薄曇」の時にあり。

weatherが「曇」「薄曇」のときにprecipitationの値があるときあり。

（雨降ったのに曇とした？？前日あるいは朝は予報が曇だったけど昼は雨だった？？）

precipitationの値があるときsoldout率が高いように見える。

weather、precipitationから以下の特徴量を作成する。

weatherが「快晴」「晴れ」かつ「precipitation」が「--」…「天気優」

weatherが「薄雲」「曇」かつ「precipitation」が「--」…「天気可」

weatherが「雨」「薄雲」「曇」かつ「precipitation」の値 < 2…「天気やや悪」

weatherが「雪」「雷電」または「雨」かつ「precipitation」の値 >= 2…「天気悪」

week列

dat['week'].unique()

---------------------------------

データ加工

----------------------------------

1.欠損値埋め

payday列 nullを0埋め

dat['payday'] = dat['payday'].fillna(0)

kcal列 nullへnull以外の平均値埋め

dat['kcal'] = dat['kcal'].fillna(dat['kcal'].mean())

2.新規特徴量作成

2013-11-18を1日として経過日をdaysとする

dat['days'] = (pd.to_datetime(dat['datetime'])-pd.to_datetime(dat['datetime'][0])).dt.days+1

変数結果確認

dat['days'].unique()

2013-11-18を1週として経過週をweeksとする

dat['weeks']= ((dat['days']-3) / 7).round()+1

特徴量確認

dat.loc[dat['datetime'].notnull() , ['datetime','days','weeks']]

2013-11月を1月目として経過月をmonthsとする。

w_time = pd.to_datetime(dat['datetime'])
dat['months']= (w_time.dt.year - w_time[0].year)*12 + w_time.dt.month - w_time[0].month + 1

event列

dat['event'].value_counts()

dat['event_mama'] = dat['event'].apply(lambda x: 1 if x=='ママの会' else 0)
dat['event_seminar'] = dat['event'].apply(lambda x: 1 if x=='キャリアアップ支援セミナー' else 0)

Remarks列

dat['remarks'].value_counts()

dat['remarks_fun'] = dat['remarks'].apply(lambda x: 1 if x=='お楽しみメニュー' else 0)
dat['remarks_chef'] = dat['remarks'].apply(lambda x: 1 if x=='料理長のこだわりメニュー' else 0)

dat['remarks_special'] = dat['remarks'].apply(lambda x: 1 if x=='スペシャルメニュー（800円）' else 0)

dat['remarks_2types'] = dat['remarks'].apply(lambda x: 1 if x=='鶏のレモンペッパー焼（50食）、カレー（42食）' or x=='酢豚（28食）、カレー（85食）' else 0)

dat['remarks_handmade'] = dat['remarks'].apply(lambda x: 1 if x=='手作りの味' else 0)

dat['remarks_rival'] = dat['remarks'].apply(lambda x: 1 if x=='近隣に飲食店複合ビルオープン' else 0)

近隣に飲食店複合ビルオープンの後続の日も影響あると考え、remarks_rival2としてフラグを立てる

wk_idx= dat.query('remarks_rival==1')
dat['remarks_rival2'] = dat['days'].apply(lambda x: 1 if x>=wk_idx.at[wk_idx.index[0], 'days'] else 0)

weather,precipitation列

dat['precipitation2'] = dat['precipitation'].apply(lambda x: -1 if x=='--' else x)
dat = dat.astype({'precipitation2': float})

dat['precipitation2'].value_counts()

dat['weather2_優']=dat['weather'].apply(lambda x: 1 if x=='快晴' or x=='晴れ' else 0)
dat['weather2_良']=dat['weather'].apply(lambda x: 1 if x=='曇' or x=='薄曇' else 0)
dat.loc[(dat['precipitation2']>=0.0), 'weather2_良']=0
dat['weather2_やや悪']=dat['precipitation2'].apply(lambda x: 1 if x>=0 and x<2 else 0)
dat['weather2_悪']=dat['precipitation2'].apply(lambda x: 1 if x>=2 else 0)
dat.loc[(dat['weather']=='雪'), ['weather2_やや悪','weather2_悪']]=[0,1]
dat.loc[(dat['weather']=='雷電'), ['weather2_やや悪','weather2_悪']]=[0,1]

特徴量確認

dat.loc[dat['datetime'].notnull(),['datetime','weather','precipitation2','weather2_優','weather2_良','weather2_やや悪','weather2_悪']]

年、月、日の変数作成

dat['year']=pd.to_datetime(dat['datetime']).dt.year
dat['month']=pd.to_datetime(dat['datetime']).dt.month
dat['day']=pd.to_datetime(dat['datetime']).dt.day

休日明けフラグ作成

dat['next_holiday'] = dat['days'].diff(-1).fillna(0).apply(lambda x: 1 if x < -1 else 0)

翌日休日フラグ作成

dat['first_weekday'] = dat['days'].diff(1).fillna(0).apply(lambda x: 1 if x > 1 else 0)

前日との気温差

dat['temperature_diff'] = dat['temperature'].diff(1).fillna(0)

週の初日は影響ないと考え気温差0とする

dat.loc[(dat['days'].diff(1).fillna(0)>1), 'temperature_diff']=0

temperature_diff確認

dat[['datetime','temperature_diff']]

弁当種類設定

カレーフラグ

dat['弁当_curry'] = dat['name'].apply(lambda x : 1 if x.find('カレー')>=0
and x.find('鶏肉のカレー唐揚')==-1
and x.find('ハンバーグカレーソース')==-1 else 0)

魚介フラグ

dat['弁当_fish'] = dat['name'].apply(lambda x : 1 if x.find('イカ')>=0
or x.find('いか')>=0
or x.find('海老')>=0
or x.find('エビ')>=0
or x.find('ホタテ')>=0
or x.find('カキ')>=0
or x.find('白身魚')>=0
or x.find('さんま')>=0
or x.find('カレイ')>=0
or x.find('サバ')>=0
or x.find('メダイ')>=0
or x.find('かじき')>=0
or x.find('さわら')>=0
or x.find('ます')>=0
or ( x.find('マス')>=0 and x.find('マスタード')==-1)
or x.find('サーモン')>=0
or x.find('アジ')>=0
or x.find('キス')>=0
or x.find('ぶり')>=0
or x.find('八宝菜')>=0 else 0)

肉フラグ

dat['弁当_meat'] = dat['name'].apply(lambda x : 1 if x.find('ソーセージ')>=0
or x.find('キーマ')>=0
or x.find('ハンバーグ')>=0
or x.find('メンチ')>=0
or x.find('チキン')>=0
or x.find('ベーコン')>=0
or x.find('肉')>=0
or x.find('ビーフ')>=0
or x.find('ヒレカツ')>=0
or x.find('ひれかつ')>=0
or x.find('ポーク')>=0
or x.find('ロース')>=0
or x.find('牛')>=0
or x.find('鶏')>=0
or x.find('ハムカツ')>=0
or x.find('味噌カツ')>=0
or x.find('豚')>=0
or x.find('ロコモコ')>=0
or x.find('ボローニャ風カツ')>=0
or x.find('スタミナ炒め')>=0
or x.find('チャプチェ')>=0
or x.find('プルコギ')>=0
or x.find('マーボ')>=0
or x.find('ミックスグリル')>=0
or x.find('親子')>=0
or x.find('筑前煮')>=0
or x.find('中華丼')>=0
or x.find('唐揚げ丼')>=0
or x.find('八宝菜')>=0
or x.find('麻婆')>=0
or x.find('トンカツ')>=0
else 0)

シチューフラグ

dat['弁当_stew'] = dat['name'].apply(lambda x : 1 if x.find('シチュー')>=0 else 0)

丼フラグ

dat['弁当_ricebowl'] = dat['name'].apply(lambda x : 1 if x.find('丼')>=0 else 0)

辛いフラグ

dat['弁当_spicy'] = dat['name'].apply(lambda x : 1 if x.find('辛')>=0
or x.find('チリソース')>=0
or x.find('麻婆')>=0
or x.find('麻婆')>=0
or x.find('マーボ')>=0
or x.find('チャプチェ')>=0
or x.find('プルコギ')>=0
or x.find('キムチ')>=0
or x.find('回鍋肉')>=0
else 0)

dummy変数化

dat = pd.get_dummies(dat, columns=["weather","week"])

3.特徴量作成後の確認

①欠損数、データ型確認

学習＋テストデータ

dat.info()

②標準統計量

学習データ

train.describe()

テストデータ

test.describe()

学習＋テストデータ

dat.describe()

加工後のデータ出力

dat.to_csv("dat20200124_01.csv",index=None,header=True)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

冨井さんプログラム２

---------------------------------

データ等の準備

----------------------------------

pandas環境設定

train_xは学習データ、train_yは目的変数、test_xはテストデータ

pandasのDataFrame, Seriesで保持します。（numpyのarrayで保持することもあります）

train = pd.read_csv('./train2.csv',encoding="SHIFT-JIS")

test = pd.read_csv('./test2.csv',encoding="SHIFT-JIS")

学習データとテストデータを纏める

インデックスを日付順に振りなおす

---------------------------------

初期データ確認

----------------------------------

1.標準統計量

学習データ

テストデータ

学習＋テストデータ

2.欠損数、データ型確認

学習データ

テストデータ

学習＋テストデータ

event列

(方針)event列は、「ママの会」「キャリアアップ支援セミナー」のone-hot列を作成し、event列自体はdropする

kcal列

kcalがいつ欠損値が発生しているか確認

kcalの読み取れること

・2013/11/18(学習データ初日）～2013/12/26まではkcalが無い。

⇒2014年からkcalを取り始めたと考えられる。

・2014年以降でも学習データ、テストデータで欠損値有り。ただし、remarksがお楽しみメニューorスペシャルメニュー(800円)の時である。

(方針)kcalの欠損値は欠損値が発生していないkcalの平均で埋める。

payday列

(方針)payday列欠損値は0で埋める

remarks列

remarksが設定されている内容確認

(方針)remarks列は、「お楽しみメニュー」「料理長のこだわりメニュー」「近隣に飲食店複合ビルオープン」のone-hot列を作成し、remarks列自体はdropする

・2014-1-24の「鶏のレモンペッパー焼（50食）、カレー（42食）」、2014-2-21の「酢豚（28食）、カレー（85食）」は、2種類の弁当を売っているが、テストデータでこのパターンが無いので、レコードをドロップする。

・2014-9-26の「スペシャルメニュー（800円）」はメニュー名「ランチビュッフェ」からお弁当ではないと判断できるのでレコードをdropする。

・2014-9-2の「手作りの味」は、テストデータで出てこないので、特徴量として残さない。

weather列

precipitation列

weather列と# precipitation列の組み合わせ確認

precipitationが「--」は、weatherが「快晴」「晴れ」「曇」「薄曇」の時にあり。

weatherが「曇」「薄曇」のときにprecipitationの値があるときあり。

（雨降ったのに曇とした？？ 前日あるいは朝は予報が曇だったけど昼は雨だった？？）

precipitationの値があるときsoldout率が高いように見える。

weather、precipitationから以下の特徴量を作成する。

weatherが「快晴」「晴れ」かつ「precipitation」が「--」…「天気優」

weatherが「薄雲」「曇」かつ「precipitation」が「--」…「天気可」

weatherが「雨」「薄雲」「曇」かつ「precipitation」の値 < 2…「天気やや悪」

weatherが「雪」「雷電」または「雨」かつ「precipitation」の値 >= 2…「天気悪」

week列

---------------------------------

データ加工

----------------------------------

1.欠損値埋め

payday列 nullを0埋め

kcal列 nullへnull以外の平均値埋め

2.新規特徴量作成

2013-11-18を1日として経過日をdaysとする

変数結果確認

2013-11-18を1週として経過週をweeksとする

特徴量確認

2013-11月を1月目として経過月をmonthsとする。

event列

dat['event'].value_counts()

Remarks列

dat['remarks'].value_counts()

dat['remarks_special'] = dat['remarks'].apply(lambda x: 1 if x=='スペシャルメニュー（800円）' else 0)

dat['remarks_2types'] = dat['remarks'].apply(lambda x: 1 if x=='鶏のレモンペッパー焼（50食）、カレー（42食）' or x=='酢豚（28食）、カレー（85食）' else 0)

dat['remarks_handmade'] = dat['remarks'].apply(lambda x: 1 if x=='手作りの味' else 0)

近隣に飲食店複合ビルオープンの後続の日も影響あると考え、remarks_rival2としてフラグを立てる

weather,precipitation列

dat['precipitation2'].value_counts()

特徴量確認

年、月、日の変数作成

休日明けフラグ作成

翌日休日フラグ作成

前日との気温差

週の初日は影響ないと考え気温差0とする

　⇒2014年からkcalを取り始めたと考えられる。

（雨降ったのに曇とした？？前日あるいは朝は予報が曇だったけど昼は雨だった？？）