More than 5 years have passed since last update.

Yet another 機械学習で株価を予測する (2)

Last updated at 2018-08-29Posted at 2018-08-26

1. 今回の目標

　前回 (Yet another 機械学習で株価を予測する (1))は過去の日経平均のデータから今日の株価の終値が始値より上がるか、下がるか機械学習で判定するプログラムを作りました。しかし実行のたびに結果がばらつき、果たして使い物になるのかならないのかよくわかりませんでした。そこで今回は交差検定を行って、もう少し精密に判定を行いたいと思います。また合わせてプログラム中のパラメーターをまとめて、設定を変更した実験を行いやすくしたいと思います。

2. 交差検定

ここではsklearn.model_selectionのKFoldを用いて交差検定を行います。なお参考文献1では

dummy01.py

from sklearn.cross_validation import KFold

としていますが、こちらは今後使えなくなるようです。
　具体的な使い方は

dummy02.py

from sklearn.model_selection import KFold

kf = KFold(n_splits=n_splits)

for train, test in kf.split(df):
    X_train, r_train = df[columns_input].iloc[train], df["Result"].iloc[train]
    X_test,  r_test  = df[columns_input].iloc[test], df["Result"].iloc[test]
    X_train, X_test = X_train.values, X_test.values

のようにdfをn_splits分割するようなindexをKfoldで生成します。

3. パラメーター設定をまとめる

また変更可能なパラメーターをまとめて前の方で設定できるようにします。今回設定できるようにしたパラメータは

dummy03.py

    ###########################################################
    # 以下のブロックでパラメーターを指定する
    # この日以降のデータを使う
    start = '2008-01-01'

    # この日以前のデータを使う
    end = '2017-12-31'

    # 上昇または下落を判定するしきい値
    threshold = 0.001

    # RandomForestのn_estimator
    n_estimators = 5

    # KFoldの分割数
    n_splits = 10
    ##########################################################

の通りです。このうち、上昇/下落のしきい値を設定する「threshold」と、RandomFrorestの「n_estimator」が予測精度に効いてきそうなパラメーターです。ひょっとするとデータの期間によって傾向が違うかもしれませんので、「start」と「end」で使用するデータの期間を変えられるようにします。
　また、テストデータと平均利益は交差検定の平均だけでなく、標準偏差を求めて

    平均 ± 標準偏差

という形式で表示するようにしました。この「±」はあくまでも目安で、全ての(あるいは大部分の)試行結果がこの範囲に収まるわけでは無いことにご注意ください。

4. 実行結果

　実行結果の一例です。平均利益は0付近で大きくばらついていますので、これではサイコロを振って決めるのと大きく変わらなそうです。

Start date: 2008-01-01
End date: 2017-12-31
 threshold: 0.10 %
 n_estimators: 5
 n_splits: 10
   positive training accuracy: 0.941
   positive test accuracy: 0.510±0.029
   positive mean gain: -0.013±0.036 %
   negative training accuracy: 0.942
   negative test accuracy: 0.505±0.026
   negative mean gain: 0.004±0.027 %

　しきい値を0.2%に上げてみると

Start date: 2008-01-01
End date: 2017-12-31
 threshold: 0.20 %
 n_estimators: 5
 n_splits: 10
   positive training accuracy: 0.947
   positive test accuracy: 0.535±0.038
   positive mean gain: -0.022±0.033 %
   negative training accuracy: 0.944
   negative test accuracy: 0.526±0.041
   negative mean gain: -0.025±0.031 %

となり、明らかに成績が悪くなりました。このままではわざわざ機械学習を使う意味がないので、次は日経平均以外の指標も使って予測の改善にチャレンジします。

参考文献

「実践機械学習システム」 W. Richert, L. P. Coelho (著) オライリー・ジャパン

ソース

qiita02.py

# !/usr/bin/env python
# -*- coding: utf-8 -*-
# Copyright by troilus (2018)

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold


if __name__ == '__main__':
    ###########################################################
    # 以下のブロックでパラメーターを指定する
    # この日以降のデータを使う
    start = '2008-01-01'

    # この日以前のデータを使う
    end = '2017-12-31'

    # 上昇または下落を判定するしきい値
    threshold = 0.001

    # RandomForestのn_estimator
    n_estimators = 5

    # KFoldの分割数
    n_splits = 10
    ##########################################################

    print("Start date: {}".format(start))
    print("End date: {}".format(end))
    print(" threshold: {:.2f} %".format(threshold * 100.0))
    print(" n_estimators: {}".format(n_estimators))
    print(" n_splits: {}".format(n_splits))

    df = pd.read_csv("N225.csv", na_values=["null"])
    df["Date"] = pd.to_datetime(df["Date"])
    df = df.set_index("Date")
    df = df[["Open", "High", "Low", "Close"]]

    df = df.dropna()
    
    df["Open"] /= df["Close"]
    df["High"] /= df["Close"]
    df["Low"] /= df["Close"]
    df["Close"] = 1.0
    df["Result"] = 1/df["Open"].shift(-1)


    df = df[1:]
    df = df[start:end]

    columns_input = ["Open", "High", "Low", "Close"]
    acc_train = {'positive': [], 'negative': []}
    acc_test  = {'positive': [], 'negative': []}
    gain      = {'positive': [], 'negative': [], 'total': {}}


    kf = KFold(n_splits=n_splits)

    for train, test in kf.split(df):
        X_train, r_train = df[columns_input].iloc[train], df["Result"].iloc[train]
        X_test,  r_test  = df[columns_input].iloc[test], df["Result"].iloc[test]
        X_train, X_test = X_train.values, X_test.values

        clf = RandomForestClassifier(n_estimators=n_estimators)
        for polarity in ["positive", "negative"]:
            if polarity == "positive":
                y_train = [1 if r - 1.0 >= threshold else 0 for r in r_train]
                y_test = [1 if r - 1.0 >= threshold else 0 for r in r_test]
            else:
                y_train = [1 if 1.0 - r >= threshold else 0 for r in r_train]
                y_test = [1 if 1.0 - r >= threshold else 0 for r in r_test]
            
            clf.fit(X_train, y_train)
            acc_train[polarity].append(clf.score(X_train, y_train))
            acc_test[polarity].append(clf.score(X_test, y_test))
            if polarity == "positive":
                gain[polarity].append((clf.predict(X_test) * (r_test - 1)).mean())
            else:
                gain[polarity].append((clf.predict(X_test) * (1 - r_test)).mean())

    for polarity in ["positive", "negative"]:
        print('   {} training accuracy: {:.3f}'.format(polarity, np.array(acc_train[polarity]).mean()))
        print('   {} test accuracy: {:.3f}±{:.3f}'.format(polarity, 
                                                           np.array(acc_test[polarity]).mean(),
                                                           np.array(acc_test[polarity]).std()))
        print('   {} mean gain: {:.3f}±{:.3f} %'.format(polarity, 
                                                 np.array(gain[polarity]).mean()*100,
                                                 np.array(gain[polarity]).std()*100))

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up