More than 5 years have passed since last update.

「Scikit-learn」と「Chainer」で「回帰」分析する

Last updated at 2020-05-03Posted at 2020-04-30

Scikit-learnとChainerの勉強がてら、SIGNATEの「お弁当の需要予測」をやってみた。

・python3,pandas,numpy,scikit-learn,chainer
・JupyterNotebook
・Mac

（１）環境構築

.py

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

# Scikit-learnの回帰モデルを行うライブラリをインストール
from sklearn.linear_model import LinearRegression as LR

# Chainerのインストール
import chainer
# 非線型変換を行うライブラリ
import chainer.functions as F
# 線形変換を行うライブラリ
import chainer.links as L

（２）現状分析

①データ取得

SIGNATEからtrainデータとtestデータをダウンロードして取得する。

.py

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
sample = pd.read_csv("sample.csv",header=None)

# 行や列を省略しない処理（以下は500文字をmaxとした）
# 解除する場合は、数値の箇所をNoneに変更
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

train.shape
# (207,12)
test.shape
# (40,11)

# データを確認する①
train.head()

# データを確認する②
train.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 207 entries, 0 to 206
# Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
# ---  ------         --------------  -----  
# 0   datetime       207 non-null    object 
# 1   y              207 non-null    int64  
# 2   week           207 non-null    object 
# 3   soldout        207 non-null    int64  
# 4   name           207 non-null    object 
# 5   kcal           166 non-null    float64
# 6   remarks        21 non-null     object 
# 7   event          14 non-null     object 
# 8   payday         10 non-null     float64
# 9   weather        207 non-null    object 
# 10  precipitation  207 non-null    object 
# 11  temperature    207 non-null    float64
# dtypes: float64(3), int64(2), object(7)
# memory usage: 19.5+ KB

②欠損値を埋める

.py

train.isnull().sum()
test.isnull().sum()

train = train.fillna(0)
test = test.fillna(0)

③入力変数を加工(datetime)

datetimeを分解して、int型に変更する。

.py

train["year"] = train["datetime"].apply(lambda x :x.split("-")[0])
train["month"] = train["datetime"].apply(lambda x :x.split("-")[1])
test["year"] = test["datetime"].apply(lambda x :x.split("-")[0])
test["month"] = test["datetime"].apply(lambda x :x.split("-")[1])

train["year"] = train["year"].astype(np.int)
train["month"] = train["month"].astype(np.int)
test["year"] = test["year"].astype(np.int)
train["year"] = train["year"].astype(np.int)

④入力変数を加工（remarks)

remarksの”お楽しみメニュー”を数値に変換する。

.py

def henkan(x):
    if x == "お楽しみメニュー":
        return 1
    else:
        return 0
train["remarks_henkan"] = train["remarks"].apply(lambda x:henkan(x))
test["remarks_henkan"] = train["remarks"].apply(lambda x:henkan(x))

⑤入力変数を加工（event)

eventの中身を数値に変換する。

.py

def henkan2(x):
    if x == 0:
        return 0
    else:
        return 1

train["event_henkan"] = train["event"].apply(lambda x:henkan2(x))
test["event_henkan"] = test["event"].apply(lambda x:henkan2(x))

⑥入力変数を加工（week）

.py

# weekカラムのデータを抽出
train_week = train.iloc[:,2]

type(train_week)
# pandas.core.series.Series

# pandasのSeries型を、pandasのDataframe型に変更
train_week = pd.DataFrame(train_week)

# ダミー変数化する
train_week = pd.get_dummies(train_week["week"])

# testも同様に対応する
test_week = test.iloc[:,1]
test_week = pd.DataFrame(test_week)
test_week = pd.get_dummies(test_week["week"])

⑦入力変数を加工（temperature）

temperatureをビニングする。

.py

# 最小値と最大値を確認する
train["temperature"].describe()
# count    207.000000
# mean      19.252174
# std        8.611365
# min        1.200000
# 25%       11.550000
# 50%       19.800000
# 75%       26.100000
# max       34.600000
# Name: temperature, dtype: float64

temperature_bining_trainX = pd.cut(train["temperature"],[0,10,20,30,40])

type(temperature_bining_trainX)
# pandas.core.series.Series

# DataFrame型に変換
temperature_bining_trainX = pd.DataFrame(temperature_bining_trainX)

testXも同様の対応する。

.py

test["temperature"].describe()
temperature_bining_testX = pd.cut(test["temperature"],[0,10,20,30,40])
temperature_bining_testX = pd.DataFrame(temperature_bining_testX)
testX["temperature_bining"] = temperature_bining_testX["temperature"]

⑧trainX、testX、yを設定

.py

trainX = train[["event_henkan","remarks_henkan","year","month","payday"]]
testX = test[["event_henkan","remarks_henkan","year","month","payday"]]
y = train["y"]

⑨trainX,testXに加工したweekとtemperatureをドッキングする

temperature_bining_trainXや、temperature_bining_testXは、category型であるが、数値でないと、scikit-learnやchainerで回帰分析できないためint型にする。以下のようにstr型を経てint型にする。

.py

trainX[["月","火","水","木","金"]] = train_week[["月","火","水","木","金"]]
testX[["月","火","水","木","金"]] = test_week[["月","火","水","木","金"]]

trainX["temperature_bining"] = temperature_bining_trainX["temperature"]
testX["temperature_bining"] = temperature_bining_testX["temperature"]
trainX["temperature_bining"] = pd.DataFrame(trainX["temperature_bining"],dtype=np.str)
testX["temperature_bining"] = pd.DataFrame(testX["temperature_bining"],dtype=np.str)
# str型をint型にする
def henkan_u(x):
    if "(0, 10]" in x:
        return 0
    elif "(10, 20]" in x:
        return 1
    elif "(20, 30]" in x:
        return 2
    else:
        return 3
trainX["temperature_bining"] = trainX["temperature_bining"].apply(lambda x:henkan_u(x))
testX["temperature_bining"] = testX["temperature_bining"].apply(lambda x:henkan_u(x))

（２）入力変数を整理

trainX、testXの入力変数の確認、index数、columns数を確認する。

.py

trainX.columns
# Index(['event_henkan', 'remarks_henkan', 'year', 'month', 'payday', '月', '火','水', '木', '金', 'temperature_bining'],dtype='object')

trainX.shape
# (207, 11)
testX.shape
# (40, 11)
y.shape
# (207,)

（３）Scikit-learnによる回帰分析を行う

.py

model = LR()
model.fit(trainX,y)
result = model.predict(testX)

sample[1] = result
sample.to_csv("submit.csv",header = None,index = None)

ファイルをSIGNATEにアップデートしたところ、結果は、13.8053143となった。

（４）Chainerによる回帰分析を行う

①データ準備

pandasのDataFrame型からnumpyのndarray型に直す必要がある。
ndarray型にはvalues関数を使う。また、Chainerにおいて64bitではダメなので32ビットにする必要がある。float型64bitを32bitに変えるには、astype('f')とし、int型64bitを32bitに変えるにはastype('i')とする。（メモ：ただし、chainer回帰分析では入力変数、出力変数共に合わせないとtrainer.runでエラーになるので、x,t共にfloat型のastype('f')とする。）

.py

x = trainX.values
x.dtype
# dtype('float32')
x = trainX.values.astype('f')
type(x)
# numpy.ndarray

x.shape
# (207, 11)

yも同様にarray, float型にする。

.py

t = y.values
y.dtype
# dtype('int64')
t = y.values.astype('f')
t.dtyep
# dtype('float32')
t.shape
# (207,)

ここで、(207,)という結果値は「分類」の場合は問題ないが、「回帰」の場合は(207,1)
となっていないと、trainer.runのタイミングでエラーとなる（207と1列という明確な形が必要）ため、以下のようにする。

.py

 t = t.reshape(len(t),1)
t.shape
# (207, 1)

testXも対応

.py

tx = testX.values.astype('f')

②データセットの準備

Chainerで使用するデータセットの形式に変換。
x（入力変数）と、t（出力変数（教師データ））をChainerで使えるようにするには、まず、タプルで囲い（zip関数を使う）リスト化する必要がある。タプルで囲う際は（入力変数、出力変数（教師データ））の順で囲う。

.py

dataset = list(zip(x,t))

③訓練データと検証データに分類

入力データを分類する。訓練データを7割、検証データを3割りとする。また、分割後はint型にして整数型にしておく。以下のように分割するがデータに偏りが出るためrandom関数でランダムとする。なおシードも設定する。

.py

len(dataset)
# 207
n_train = int(len(dataset)*0.7)
n_train
# 144
train,test = chainer.datasets.split_dataset_random(dataset,n_train,seed=0)
len(train)
# 144
len(test)
# 63

④モデルを定義する

以下のようにクラスを作成。

.py

class NN(chainer.Chain):
    def __init__(self,n_mid_units1=5,n_mid_units2=3,n_out=1):
        super().__init__()
        with self.init_scope():
            self.fc1 = L.Linear(None,n_mid_units1)
            self.fc2 = L.Linear(None,n_mid_units2)
            self.fc3 = L.Linear(None,n_out)
            
            self.bn = L.BatchNormalization(11)
    
    def __call__(self,x):
        h = self.bn(x)
        h = self.fc1(h)
        h = F.relu(h)
        h = self.fc2(h)
        h = F.relu(h)
        h = self.fc3(h)
        return h

上記でクラスの定義（モデルの定義）が完了したのでインスタンス化する。
加えて、モデルを計算していく（学習）際に、評価関数など進捗のレポートの機能を標準装備するL.Classifierを使う。モデルを設定した際にランダムに初期化されるので、モデル定義の前にシードを設定する（数値はなんでもいい、この場合は1とした)。

また、分類では交差エントロピー誤差を使うのに対して、回帰では平均二乗誤差を使う。
L.Classifier（nn)のnnにマウスを当てて、shift + tabを押すと、
L.Classifier(
predictor,
lossfun=,
accfun=,
label_key=-1,
)
と記載あり、lossfunとは、損失関数を表す引数。softmax_cross_entropy（交差エントロピー誤差）がデフォルトになっているので、lossfunを、mean_squeared_error（平均二乗誤差）に書き換える必要がある。
加えて、modelの中に、compute_accuracyがあるが、分類は精度を表す指標があるがこれはどれくらい100個中10個あってるといったものが精度（どれくらい正解だったか）、ただ、回帰は誤差（どれくらい予測した値と実際の値が違うのか観点）のため、accuracyという概念はないことから、accuracyをFalseに変更する。

.py

np.random.seed(0)
nn = NN()
model = L.Classifier(nn,lossfun = F.mean_squared_error)
model.compute_accuracy = False

⑤その他設定

.py

# optimiezeの設定
optimizer = chainer.optimizers.Adam()
optimizer.setup(model)

# iteratorの設定
# 大体10から100で設定、サンプルが207の設定なので10バッジで20回パラメータ更新して1epoch、どのくらいのepochを設定するかは後述で設定。
batchsize = 10
train_iter = chainer.iterators.SerialIterator(train,batchsize)
test_iter = chainer.iterators.SerialIterator(train,batchsize,repeat=False,shuffle=False)

# updaterの設定
from chainer import training
updater = training.StandardUpdater(train_iter,optimizer,device=-1)

# trainerとextensionsの設定
from chainer.training import extensions
epoch = 1500
trainer = training.Trainer(updater,(epoch,'epoch'),out='result/obentou')
trainer.extend(extensions.Evaluator(test_iter,model,device=-1))
trainer.extend(extensions.LogReport(trigger=(1,'epoch')))
trainer.extend(extensions.PrintReport(['epoch','main/loss','validation/main/loss','elapsed_time']),trigger=(1,'epoch'))#trainデータに対するloss（損失関数の値））、testデータに対するloss(損失関数の値）、経過時間を出力するという意味

⑥学習の実行

.py

trainer.run()

⑦学習結果の可視化

.py

import json
with open("result/obentou/log")as f:
    logs = json.load(f)
    results = pd.DataFrame(logs)

results[["main/loss","validation/main/loss"]].plot()

二乗誤差なので、スケールを戻す。

.py

import math
math.sqrt(loss)

⑧testデータを予測する

データの予測にあたっては、txをそのまま使うことはできない。chainerでは、（バッジサイズ、入力変数の数）の形にして、model.predictorに代入する必要がある。そのため以下を実施していmる

.py

tx[0].shape
# (11,)
tx[0][np.newaxis].shape
# (1, 11) ←この形が必要

上記から以下のように記述する。

.py

result=[]
for i in range (40):
    x0 = tx[i].reshape(1,len(tx[i]))
    with chainer.using_config('train' , False),chainer.using_config('enable_backprop',False):
        #推論実施
　　　　　y0_predict = model.predictor(x0)
        #chainerで使う形式から、numpyに変換
        y1_predict = y0_predict.array
        #array([[●●●]], dtype=float32)となるため、[0][0]で取り出す
        result.append(y1_predict[0][0])

ファイルをSIGNATEにアップデートしたところ、結果は、13.2649969となった。

（５）まとめ

Scikit-learnで回帰分析を行うと13.805、一方でChainerで回帰分析を行うと、13.264とわずかにChainerの方が良い結果となった。
結果として、成績は両方ともイマイチであった。恐らくもっと有効な手法があると思うが、これから掘り下げて学習していきたい。

（６）Chainerで「回帰」分析と「分類」との違いをメモ

１）pandasの形から、numpyの形に変換する際に、x（trainX）とyの型を合わせる必要がある。

（本文抜粋）pandasのDataFrame型からnumpyのndarray型に直す際にはvalues関数を使う。また、Chainerにおいて64bitではダメなので32ビットにする必要がある。float型64bitを32bitに変えるには、astype('f')とし、int型64bitを32bitに変えるにはastype('i')とする。（メモ：ただし、「回帰」は入力変数、出力変数共に合わせないとtrainer.runでエラーになるので、x,t共にfloat型のastype('f')とする。「分類」では異なっていてもエラーにならない。）

２）pandasからnumpyの型に変換した際に、明示的に２次元の型にしなければならない

（本文抜粋）ここで、(207,)という結果値は「分類」の場合は問題ないが、「回帰」の場合は(207,1)となっていないと、trainer.runのタイミングでエラーとなる（207と1列という明確な形が必要）ため、以下のようにする。

.py

 t = t.reshape(len(t),1)
t.shape
# (207, 1)

３）「分類」では交差エントロピー誤差を使うのに対して、「回帰」では平均二乗誤差を使う。また、compute_accuracyをFalseにする必要がある。

（本文抜粋）

.py

np.random.seed(0)
nn = NN()
model = L.Classifier(nn,lossfun = F.mean_squared_error)
model.compute_accuracy = False

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up