More than 5 years have passed since last update.

Python 流量予測を機械学習でやってみる（scikit-learnの利用）

Last updated at 2019-09-07Posted at 2019-09-03

概要

前回紹介した記事ではタンクモデルで雨量から流量予測を行ったが、これを機械学習で行ってみた。使用した機械学習のライブラリはscikit-learnである。
基本的にやっていることは重回帰分析であり、使った手法は以下のものである（import部分を表示）。

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor

入力データ

雨量から流量予測を行う場合、入力に工夫が必要である。すなわち、流量はその日の雨量のみならず、前日あるいは前々日・・・というように、以前に降った雨の影響を受けていることは経験的に明らかである。このため、以下の２ケースの入力を考えた。

Case-1

rf0	当日の雨量
rf1	前日の雨量
rf2	前々日の雨量
...
rf10	10日前の雨量

入力データファイルのイメージ（Case-1）

      date,  rf0,  rf1,  rf2,  rf3,  rf4,  rf5,  rf6,  rf7,  rf8,  rf9,  rf10, Qo
2000/03/01,  0.2
2000/03/02,  0  ,  0.2
2000/03/03,  1.5,  0  ,  0.2
2000/03/04,  0  ,  1.5,  0  ,  0.2
2000/03/05,  0  ,  0  ,  1.5,  0  ,  0.2
2000/03/06, 13.6,  0  ,  0  ,  1.5,  0  ,  0.2
2000/03/07,  4.4, 13.6,  0  ,  0  ,  1.5,  0  ,  0.2
2000/03/08,  0.2,  4.4, 13.6,  0  ,  0  ,  1.5,  0  ,  0.2
2000/03/09,  1.7,  0.2,  4.4, 13.6,  0  ,  0  ,  1.5,  0  ,  0.2
2000/03/10, 10.9,  1.7,  0.2,  4.4, 13.6,  0  ,  0  ,  1.5,  0  ,  0.2
2000/03/11,  0  , 10.9,  1.7,  0.2,  4.4, 13.6,  0  ,  0  ,  1.5,  0  ,  0.2,  q1
2000/03/12,  0.9,  0  , 10.9,  1.7,  0.2,  4.4, 13.6,  0  ,  0  ,  1.5,  0  ,  q2
2000/03/13,  0.4,  0.9,  0  , 10.9,  1.7,  0.2,  4.4, 13.6,  0  ,  0  ,  1.5,  q3
...       ,  ...,  ...,  ...,  ...,  ...,  ...,  ...,  ...,  ...,  ...,  ...,  ...

Case-２

rrf0	当日の雨量
rrf1	当日と前日の雨量の和
rrf2	当日・前日・前々日の雨量の和
...
rrf10	当日から10日前の雨量の和

入力データファイルのイメージ（Case-2）

      date, rrf0, rrf1, rrf2, rrf3, rrf4, rrf5, rrf6, rrf7, rrf8, rrf9, rrf10, Qo
2000/03/01,  0.2
2000/03/02,  0  ,  0.2
2000/03/03,  1.5,  1.5,  1.7
2000/03/04,  0  ,  1.5,  1.5,  1.7
2000/03/05,  0  ,  0  ,  1.5,  1.5,  1.7
2000/03/06, 13.6, 13.6, 13.6, 15.1, 15.1, 15.3
2000/03/07,  4.4, 18  , 18  , 18  , 19.5, 19.5, 19.7
2000/03/08,  0.2,  4.6, 18.2, 18.2, 18.2, 19.7, 19.7, 19.9
2000/03/09,  1.7,  1.9,  6.3, 19.9, 19.9, 19.9, 21.4, 21.4, 21.6
2000/03/10, 10.9, 12.6, 12.8, 17.2, 30.8, 30.8, 30.8, 32.3, 32.3, 21.6
2000/03/11,  0  , 10.9, 12.6, 12.8, 17.2, 30.8, 30.8, 30.8, 32.3, 32.3, 32.5,  q1
2000/03/12,  0.9,  0.9, 11.8, 13.5, 13.7, 18.1, 31.7, 31.7, 31.7, 32.3, 33.2,  q2
2000/03/13,  0.4,  1.3,  1.3, 12.2, 13.9, 14.1, 18.5, 32.1, 32.1, 31.7, 33.6,  q3
...       ,  ...,  ...,  ...,  ...,  ...,  ...,  ...,  ...,  ...,  ...,  ...,  ...

どっちがいいか？

今回用いた入力データでは、Case-2の方が相関が良かったので、以降はCase-2の入力データを用いることとする。
なお、何日前までの雨量を入力に取り入れるかは、サイトの条件にもよるため、場合場合により変える必要があるであろう。

出力事例

各手法のうち、Ridge, Lasso, SVR，MLP についてはグリッドサーチによるパラメータ最適化をいれてある。その他の手法についてはパラメータは固定としいる。
データは17年分あるが、学習は初期の３年間（train_size=0.175）とし、残り14年はtestデータとしている。
時系列データではあるが、時系列の効果は入力データに反映されているものとし、trainデータはランダムにシャッフルしている。

MinMaxScalerによる前処理のみ

スケーリング処理のみの場合は、非線形を追跡できるSVR（カーネル：rbf）とニューラルネットワーク（MLP）でよい予測結果が得られている。

X_train.shape: (1086, 11)
y_train.shape: (1086, 1)
X_test.shape: (5123, 11)
y_test.shape: (5123, 1)
* LinearRegression
Training set score       : 0.819
Test set score           : 0.824
* Ridge
Best parameters: {'alpha': 0.1}
Best cross-validation score:        0.809
Test set score with best parameters: 0.824
* Lasso
Best parameters: {'alpha': 0.01, 'max_iter': 100000}
Best cross-validation score:        0.810
Test set score with best parameters: 0.824
* SVR(linear)
Best parameters: {'C': 10000.0, 'kernel': 'linear'}
Best cross-validation score:        0.787
Test set score with best parameters: 0.786
* SVR(rbf)
Best parameters: {'C': 10000.0, 'epsilon': 1.0, 'gamma': 0.1, 'kernel': 'rbf'}
Best cross-validation score:        0.891
Test set score with best parameters: 0.912
* MLP
Best parameters: {'activation': 'relu', 'alpha': 10, 'hidden_layer_sizes': (10, 10), 'solver': 'lbfgs'}
Best cross-validation score:         0.895
Test set score with best parameters: 0.905
* RandomForest
Training set score       : 0.978
Test set score           : 0.849
* GradientBoosting
Training set score       : 0.963
Test set score           : 0.850
* KNeighbors
Training set score       : 0.898
Test set score           : 0.835

MinMaxScalerによる前処理＋多項式特徴量（２次）追加

２次の多項式特徴量を加えると、線形回帰でも、多項式特徴量を加えない場合と比べ、予測精度は上がっている。決定木およびK近傍法では予測精度の改善は見られない。
決定木およびk近傍法では、予測結果が頭打ちになっていることもわかる。これは、教師データをはじめの３年（2001・2002・2003年）としているが、テスト側では2008・2010・2012・2016・2017年など、明らかに年雨量がはじめの３年より大きい年があり、これらの年では教師データで経験していない雨量が多く存在していると考えられ、決定木およびk近傍法は外挿には対応できないことを示していると考えられる。

X_train.shape: (1086, 78)
y_train.shape: (1086, 1)
X_test.shape: (5123, 78)
y_test.shape: (5123, 1)
* LinearRegression
Training set score       : 0.914
Test set score           : 0.888
* Ridge
Best parameters: {'alpha': 0.1}
Best cross-validation score:        0.892
Test set score with best parameters: 0.905
* Lasso
Best parameters: {'alpha': 0.01, 'max_iter': 100000}
Best cross-validation score:        0.892
Test set score with best parameters: 0.905
* SVR(linear)
Best parameters: {'C': 1000.0, 'kernel': 'linear'}
Best cross-validation score:        0.892
Test set score with best parameters: 0.905

c:\users\tscadmin\appdata\local\programs\python\python37\lib\site-packages\sklearn\model_selection\_search.py:841: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)

* SVR(rbf)
Best parameters: {'C': 10000.0, 'epsilon': 1.0, 'gamma': 0.01, 'kernel': 'rbf'}
Best cross-validation score:        0.885
Test set score with best parameters: 0.910
* MLP
Best parameters: {'activation': 'relu', 'alpha': 100, 'hidden_layer_sizes': (10, 10), 'solver': 'lbfgs'}
Best cross-validation score:         0.901
Test set score with best parameters: 0.911
* RandomForest
Training set score       : 0.977
Test set score           : 0.852
* GradientBoosting
Training set score       : 0.968
Test set score           : 0.858
* KNeighbors
Training set score       : 0.892
Test set score           : 0.827

Lasso回帰（多項式特徴量追加）での流量予測結果

重回帰による流量予測プログラム

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline


# global variables
fsz=12
xmin=0; ymin=0
xmax=600; ymax=600
bw=3; icol=3; irow=3


def drawfig(y_test, y_pred, ss, nplt):
    #=======================
    #  i=1  i=2  i=3
    #  i=4  i=5  i=6
    #  i=7  i=8  i=9
    #=======================
    plt.subplot(irow, icol, nplt)
    plt.gca().set_aspect('equal',adjustable='box')
    plt.xlim(xmin, xmax)
    plt.ylim(ymin, ymax)
    if 7<=nplt:
        plt.xlabel('y_test')
    else:
        plt.xticks([])           
    if nplt==1 or nplt==4 or nplt==7:
        plt.ylabel('y_predicted')
    else:
        plt.yticks([])
    plt.plot(y_test, y_pred, 'o')
    plt.plot([xmin,xmax],[ymin,ymax],'-')
    xs=xmin+0.05*(xmax-xmin)
    ys=ymin+0.95*(ymax-ymin)
    plt.text(xs, ys, ss, va='top', ha='left', fontsize=fsz)


def lin(X_train, X_test, y_train, y_test, XXd):
    reg=LinearRegression()
    reg.fit(X_train, y_train)
    sc_tra=reg.score(X_train, y_train)
    sc_tes=reg.score(X_test, y_test)
    y_pred=reg.predict(X_test)
    print('* LinearRegression')
    print('{0:25s}: {1:.3f}'.format('Training set score',sc_tra))
    print('{0:25s}: {1:.3f}'.format('Test set score',sc_tes))
    ss='LinearReg\n'+'R$^2$={0:.3f}'.format(sc_tes)
    nplt=1; drawfig(y_test, y_pred, ss, nplt)
    yp_all=reg.predict(XXd)
    yp_all = np.reshape(yp_all, (-1, ))
    return yp_all
    
    
def rid(X_train, X_test, y_train, y_test, XXd):
    alp = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]
    param_grid={'alpha': alp}
    grid=GridSearchCV(Ridge(), param_grid, cv=5)
    grid.fit(X_train, y_train)
    test_score=grid.score(X_test, y_test)
    y_pred=grid.predict(X_test)
    print('* Ridge')
    print('Best parameters: {}'.format(grid.best_params_))
    print('Best cross-validation score:        {0:.3f}'.format(grid.best_score_))  
    print('Test set score with best parameters: {0:.3f}'.format(test_score))  
    ss = 'Ridge\n'
    ss = ss +'R$^2$={0:.3f}'.format(test_score)
    nplt=2; drawfig(y_test, y_pred, ss, nplt)
    yp_all=grid.predict(XXd)
    yp_all = np.reshape(yp_all, (-1, ))
    return yp_all

    
def las(X_train, X_test, y_train, y_test, XXd):
    alp = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]
    param_grid={'alpha': alp, 'max_iter': [100000]}
    grid=GridSearchCV(Lasso(), param_grid, cv=5)
    grid.fit(X_train, y_train)
    test_score=grid.score(X_test, y_test)
    y_pred=grid.predict(X_test)
    print('* Lasso')
    print('Best parameters: {}'.format(grid.best_params_))
    print('Best cross-validation score:        {0:.3f}'.format(grid.best_score_))  
    print('Test set score with best parameters: {0:.3f}'.format(test_score))  
    ss = 'Lasso\n'.format(alp)
    ss = ss +'R$^2$={0:.3f}'.format(test_score)
    nplt=3; drawfig(y_test, y_pred, ss, nplt)
    yp_all=grid.predict(XXd)
    yp_all = np.reshape(yp_all, (-1, ))
    return yp_all
    

def sv1(X_train, X_test, y_train, y_test, XXd):
    cc=[1e0, 1e1, 1e2, 1e3, 1e4]
    param_grid={'kernel': ['linear'], 'C': cc}
    grid=GridSearchCV(SVR(), param_grid, cv=5)
    grid.fit(X_train, y_train)
    test_score=grid.score(X_test, y_test)
    y_pred=grid.predict(X_test)
    print('* SVR(linear)')
    print('Best parameters: {}'.format(grid.best_params_))
    print('Best cross-validation score:        {0:.3f}'.format(grid.best_score_))  
    print('Test set score with best parameters: {0:.3f}'.format(test_score))  
    ss='SVR(linear)\n'+'R$^2$={0:.3f}'.format(test_score)
    nplt=4; drawfig(y_test, y_pred, ss, nplt)
    yp_all=grid.predict(XXd)
    yp_all = np.reshape(yp_all, (-1, ))
    return yp_all


def sv2(X_train, X_test, y_train, y_test, XXd):
    cc=[1e0, 1e1, 1e2, 1e3, 1e4]
    gg=[1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1]
    param_grid={'kernel': ['rbf'], 'C': cc, 'gamma': gg, 'epsilon': [1.0]}
    grid=GridSearchCV(SVR(), param_grid, cv=5)
    grid.fit(X_train, y_train)
    test_score=grid.score(X_test, y_test)
    y_pred=grid.predict(X_test)
    print('* SVR(rbf)')
    print('Best parameters: {}'.format(grid.best_params_))
    print('Best cross-validation score:        {0:.3f}'.format(grid.best_score_))  
    print('Test set score with best parameters: {0:.3f}'.format(test_score))  
    ss='SVR(rbf)\n'+'R$^2$={0:.3f}'.format(test_score)
    nplt=5; drawfig(y_test, y_pred, ss, nplt)
    yp_all=grid.predict(XXd)
    yp_all = np.reshape(yp_all, (-1, ))
    return yp_all


def mlp(X_train, X_test, y_train, y_test, XXd):
     alp=[1e-4, 1e-3, 1e-2, 1e-1, 1e0, 10, 100]
     param_grid={'solver': ['lbfgs'], 'activation': ['relu'], 'hidden_layer_sizes': [(100,),(10,10)], 'alpha': alp}
     grid=GridSearchCV(MLPRegressor(), param_grid, cv=5)
     grid.fit(X_train, y_train)
     test_score=grid.score(X_test, y_test)
     y_pred=grid.predict(X_test)
     print('* MLP')
     print('Best parameters: {}'.format(grid.best_params_))
     print('Best cross-validation score:         {0:.3f}'.format(grid.best_score_))  
     print('Test set score with best parameters: {0:.3f}'.format(test_score))  
     ss='MLP\n'+'R$^2$={0:.3f}'.format(test_score)
     nplt=6; drawfig(y_test, y_pred, ss, nplt)
     yp_all=grid.predict(XXd)
     return yp_all
 
    
def rfr(X_train, X_test, y_train, y_test, XXd):
    reg=RandomForestRegressor(n_estimators=1000)
    reg.fit(X_train, y_train)
    sc_tra=reg.score(X_train, y_train)
    sc_tes=reg.score(X_test, y_test)
    y_pred=reg.predict(X_test)
    print('* RandomForest')
    print('{0:25s}: {1:.3f}'.format('Training set score',sc_tra))
    print('{0:25s}: {1:.3f}'.format('Test set score',sc_tes))
    ss='RandomForest\n'+'R$^2$={0:.3f}'.format(sc_tes)
    nplt=7; drawfig(y_test, y_pred, ss, nplt)
    yp_all=reg.predict(XXd)
    yp_all = np.reshape(yp_all, (-1, ))
    return yp_all


def gbr(X_train, X_test, y_train, y_test, XXd):
    reg=GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=0)
    reg.fit(X_train, y_train)
    sc_tra=reg.score(X_train, y_train)
    sc_tes=reg.score(X_test, y_test)
    y_pred=reg.predict(X_test)
    print('* GradientBoosting')
    print('{0:25s}: {1:.3f}'.format('Training set score',sc_tra))
    print('{0:25s}: {1:.3f}'.format('Test set score',sc_tes))
    ss='GradientBoosting\n'+'R$^2$={0:.3f}'.format(sc_tes)
    nplt=8; drawfig(y_test, y_pred, ss, nplt)
    yp_all=reg.predict(XXd)
    yp_all = np.reshape(yp_all, (-1, ))
    return yp_all

    
def knb(X_train, X_test, y_train, y_test, XXd):
    reg=KNeighborsRegressor(n_neighbors=5)
    reg.fit(X_train, y_train)
    sc_tra=reg.score(X_train, y_train)
    sc_tes=reg.score(X_test, y_test)
    y_pred=reg.predict(X_test)
    print('* KNeighbors')
    print('{0:25s}: {1:.3f}'.format('Training set score',sc_tra))
    print('{0:25s}: {1:.3f}'.format('Test set score',sc_tes))
    ss='KNeighbors\n'+'R$^2$={0:.3f}'.format(sc_tes)
    nplt=9; drawfig(y_test, y_pred, ss, nplt)
    yp_all=reg.predict(XXd)
    yp_all = np.reshape(yp_all, (-1, ))
    return yp_all


def inp_rf():
    # input rainfall data
    #fnameR='data_reg.csv'
    fnameR='data_reg_1.csv'
    df = pd.read_csv(fnameR, header=0, index_col=0)
    df.index = pd.to_datetime(df.index, format='%Y-%m-%d')
    df=df[df.index >= '2001-01-01'] # data after 2001-01-01 is available
    df=df[df.index <= '2017-12-31'] # data before 2017-12-31 is available
    return df


def main():
    df0 = inp_rf() # input data
    #XX=df0.loc[:,['rf0','rf1','rf2','rf3','rf4','rf5','rf6','rf7','rf8','rf9','rf10',
    #              'rf11','rf12','rf13','rf14','rf15','rf16','rf17','rf18','rf19','rf20']]
    #XX=df0.loc[:,['rf0','rf1','rf2','rf3','rf4','rf5','rf6','rf7','rf8','rf9','rf10']]
    XX=df0.loc[:,['rrf0','rrf1','rrf2','rrf3','rrf4','rrf5','rrf6','rrf7','rrf8','rrf9','rrf10']]
    yy=df0.loc[:,['Qo']]
    for iii in [2,3]:
        X_train, X_test, y_train, y_test = train_test_split(
            XX, yy, train_size=0.175, random_state=0)
        #X_train, X_test, y_train, y_test = train_test_split(
        #    XX, yy, train_size=0.176, shuffle=False, stratify=None)
        if iii==1:
            pass
        if iii==2:
            pipe=Pipeline([
                ('scaler', MinMaxScaler())
            ])  
        if iii==3:
            pipe=Pipeline([
                ('scaler', MinMaxScaler()),
                ('poly', PolynomialFeatures(degree=2,include_bias=False))
            ])  
        pipe.fit(X_train, y_train)
        X_train=pipe.transform(X_train)
        X_test=pipe.transform(X_test)
        XXd=pipe.transform(XX)
        print('X_train.shape: {}'.format(X_train.shape))
        print('y_train.shape: {}'.format(y_train.shape))
        print('X_test.shape: {}'.format(X_test.shape))
        print('y_test.shape: {}'.format(y_test.shape))
        plt.figure(figsize=(bw*icol,bw*irow), facecolor='w')
        plt.subplots_adjust(wspace=0.1, hspace=0.1)
        yp1 = lin(X_train, X_test, y_train, y_test, XXd)
        yp2 = rid(X_train, X_test, y_train, y_test, XXd)
        yp3 = las(X_train, X_test, y_train, y_test, XXd)
        yp4 = sv1(X_train, X_test, y_train, y_test, XXd)
        yp5 = sv2(X_train, X_test, y_train, y_test, XXd)
        yp6 = mlp(X_train, X_test, y_train, y_test, XXd)
        yp7 = rfr(X_train, X_test, y_train, y_test, XXd)
        yp8 = gbr(X_train, X_test, y_train, y_test, XXd)
        yp9 = knb(X_train, X_test, y_train, y_test, XXd)
        print(yp1.shape)
        print(yp2.shape)
        print(yp3.shape)
        print(yp4.shape)
        print(yp5.shape)
        print(yp6.shape)
        print(yp7.shape)
        print(yp8.shape)
        print(yp9.shape)
        #plt.tight_layout()
        fnameF='fig_reg_rf_10-'+str(iii)+'.jpg'
        plt.savefig(fnameF, dpi=100, bbox_inches="tight", pad_inches=0.1)
        plt.show()
        dfx = pd.DataFrame({'date' : df0.index.values,
                            'Rf'  : df0['rf0'],
                            'Qo'  : df0['Qo'],
                            'lin' : yp1,
                            'rid' : yp2,
                            'las' : yp3,
                            'sv1' : yp4,
                            'sv2' : yp5,
                            'mlp' : yp6,
                            'rfr' : yp7,
                            'gbr' : yp8,
                            'knb' : yp9
                       })
        dfx = dfx.set_index('date')
        if iii==2: dfx.to_csv('df2.csv')
        if iii==3: dfx.to_csv('df3.csv')


# ==============
# Execution
# ==============
if __name__ == '__main__': main()

流量予測結果のグラフのプログラムはこれと同じなので省略。

以　上

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up