More than 3 years have passed since last update.

バー

データ分析

Last updated at 2020-08-12Posted at 2020-08-12

#1 この記事は
ファイナンス機械学習2-3節バーの内容を実装する方法を記録する。

#2 内容

#2-1 データの準備

S&P500 miniのデータを下記サイトから取得します。
https://s3-us-west-2.amazonaws.com/tick-data-s3/downloads/ES_Sample.zip

・ES_trade.csv : E-Mini S&P 500 Sep '13 (ESU13)の raw tick dataです。

sample.py

import mlfinlab as ml
import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns
from statsmodels.graphics.tsaplots import plot_acf
import matplotlib.pyplot as plt
%matplotlib inline

PATH='./ES_Sample/ES_Trades.csv' #E-Mini S&P 500 Sep '13 (ESU13)の歩み値データです。
data = pd.read_csv(PATH)
data.head()

(実行結果) E-Mini S&P 500 Sep '13 (ESU13)のraw tick dataが読み込まれたことが分かります。

sample.py


Symbol	Date	Time	Price	Volume	Market Flag	Sales Condition	Exclude Record Flag	Unfiltered Price
0	ESU13	09/01/2013	17:00:00.083	1640.25	8	E	0	NaN	1640.25
1	ESU13	09/01/2013	17:00:00.083	1640.25	1	E	0	NaN	1640.25
2	ESU13	09/01/2013	17:00:00.083	1640.25	2	E	0	NaN	1640.25
3	ESU13	09/01/2013	17:00:00.083	1640.25	1	E	0	NaN	1640.25
4	ESU13	09/01/2013	17:00:00.083	1640.25	1	E	0	NaN	1640.25

#2-2 データの省データ化

ES_Trades.csv(E-Mini S&P 500 Sep '13 ESU13)のデータが大きすぎるため、必要な行のみ(「date」と「price」と「Volume」を取り出してデータの省データ化を図る。

sample.py

# Format the Data
date_time = data['Date'] + ' ' + data['Time'] # Dont convert to datetime here, it will take forever to convert.
new_data = pd.concat([date_time, data['Price'], data['Volume']], axis=1)
new_data.columns = ['date', 'price', 'volume']
print(new_data.head())
print('\n')
print('Rows:', new_data.shape[0])

PATH1='./ES_Sample/raw_tick_data.csv'
# Save to csv
new_data.to_csv(PATH1, index=False)  #CSVファイルとして書き出す。

実行結果

test.txt

                      date    price  volume
0  09/01/2013 17:00:00.083  1640.25       8
1  09/01/2013 17:00:00.083  1640.25       1
2  09/01/2013 17:00:00.083  1640.25       2
3  09/01/2013 17:00:00.083  1640.25       1
4  09/01/2013 17:00:00.083  1640.25       1


Rows: 5454950

#2-3 Dollar Bars,Volume Bars,Tick Barsを生成する。

get_dollar_bars : raw tick dataからドルバーを生成する。　

get_dollar_bars(file_path_or_df, threshold, batch_size, verbose, to_csv, output_path)
・[file_path_or_df] raw tick dataが入っているcsvファイルのpathを指定する。
・[threshold] 取引がthresholdまで行われると、それらをひとまとめにしOHLCVとする。
・[batch_size] コマンド実行に確保するRAMメモリ
・[verbose] Print out batch numbers
・[to_csv] Save bars to csv after every batch run (bool:True or False)
・[output_path] Path to csv file, if to_csv is True

★tick dataからドルバーを生成する実行例(threshold=70000000)

sample.py

dollar = ml.data_structures.get_dollar_bars('./ES_Sample/raw_tick_data.csv', threshold=70000000, batch_size=1000000, verbose=False, to_csv=True,output_path='./ES_Sample/dollar_var.csv')

test.txt

                        tick_num     open    high      low    close  volume  \
date_time                                                                      
2013-09-01 21:34:39.298     11207  1640.25  1643.5  1639.00  1640.75   42862   
2013-09-02 02:56:24.209     26547  1640.75  1646.0  1640.25  1644.50   42585   
2013-09-02 06:37:33.128     40473  1644.50  1647.5  1644.25  1647.50   42580   
2013-09-02 09:34:46.141     51328  1647.50  1648.5  1645.25  1647.00   42535   
2013-09-02 22:55:20.297     64261  1647.00  1648.5  1645.25  1648.00   42512   

                         cum_buy_volume  cum_ticks  cum_dollar_value  
date_time                                                             
2013-09-01 21:34:39.298           21896      11207       70347610.00  
2013-09-02 02:56:24.209           24320      15340       70000546.50  
2013-09-02 06:37:33.128           23167      13926       70095794.25  
2013-09-02 09:34:46.141           23904      10855       70053015.75  
2013-09-02 22:55:20.297           23884      12933       70024910.50

get_volume_bars : raw tick dataからボリュームバーを生成する。　

get_dollar_bars(file_path_or_df, threshold, batch_size, verbose, to_csv, output_path)
・[file_path_or_df] raw tick dataが入っているcsvファイルのpathを指定する。
・[threshold] 取引がthresholdまで行われると、それらをひとまとめにしOHLCVとする。
・[batch_size] コマンド実行に確保するRAMメモリ
・[verbose] Print out batch numbers
・[to_csv] Save bars to csv after every batch run (bool:True or False)
・[output_path] Path to csv file, if to_csv is True(str)

★raw tick dataからボリュームバーを生成する実行例(threshold=70000000)

sample.py

tick = ml.data_structures.get_tick_bars('./ES_Sample/raw_tick_data.csv', threshold=5500, batch_size=1000000, verbose=False,to_csv=True,output_path='./ES_Sample/tick_var.csv')

test.txt

                         tick_num     open     high      low    close  volume  \
date_time                                                                       
2013-09-01 19:32:23.387      7171  1640.25  1642.00  1639.00  1642.00   28031   
2013-09-02 01:18:21.928     16133  1642.00  1644.00  1640.25  1643.50   28003   
2013-09-02 02:50:32.992     25976  1643.50  1646.00  1642.25  1644.75   28000   
2013-09-02 04:57:09.236     35968  1644.75  1647.25  1643.75  1646.00   28000   
2013-09-02 07:04:32.076     43461  1646.00  1648.50  1645.75  1647.50   28013   

                         cum_buy_volume  cum_ticks  cum_dollar_value  
date_time                                                             
2013-09-01 19:32:23.387           15442       7171       45991914.75  
2013-09-02 01:18:21.928           14566       8962       45992828.50  
2013-09-02 02:50:32.992           15550       9843       46039902.25  
2013-09-02 04:57:09.236           14211       9992       46082594.25  
2013-09-02 07:04:32.076           17300       7493       46139513.25

get_tick_bars : raw tick dataからティックバーを生成する。　

get_dollar_bars(file_path_or_df, threshold, batch_size, verbose, to_csv, output_path)
・[file_path_or_df] raw tick dataが入っているcsvファイルのpathを指定する。
・[threshold] 取引がthresholdまで行われると、それらをひとまとめにしOHLCVとする。
・[batch_size] コマンド実行に確保するRAMメモリ
・[verbose] Print out batch numbers
・[to_csv] Save bars to csv after every batch run (bool:True or False)
・[output_path] Path to csv file, if to_csv is True

★raw tick dataからティックバーを生成する実行例(threshold=70000000)

sample.py

tick = ml.data_structures.get_tick_bars('./ES_Sample/raw_tick_data.csv', threshold=5500, batch_size=1000000, verbose=False,to_csv=True,output_path='./ES_Sample/tick_var.csv')

test.txt

                         tick_num     open    high      low    close  volume  \
date_time                                                                      
2013-09-01 18:53:51.423      5500  1640.25  1642.0  1639.00  1640.25   23119   
2013-09-01 21:29:57.152     11000  1640.25  1643.5  1639.75  1641.50   18940   
2013-09-02 01:28:59.673     16500  1641.50  1644.0  1640.25  1643.50   15011   
2013-09-02 02:22:33.934     22000  1643.50  1644.5  1642.25  1644.25   15417   
2013-09-02 03:07:48.372     27500  1644.25  1646.0  1643.75  1645.50   15328   

                         cum_buy_volume  cum_ticks  cum_dollar_value  
date_time                                                             
2013-09-01 18:53:51.423           12664       5500       37930548.25  
2013-09-01 21:29:57.152            9118       5500       31099274.25  
2013-09-02 01:28:59.673            8665       5500       24657426.50  
2013-09-02 02:22:33.934            8034       5500       25342710.50  
2013-09-02 03:07:48.372            9110       5500       25213333.00

#2-4 統計データ取得

週ごとのDollar Barsデータ,Volume Barsデータ,Tick Barsデータのバー数(行数)を確認する。

sample.py

# tick_bars,volume_bars,dollar_barsのデータを週単位ごとに何行あるかをカウントしている。
tick_count = tick_bars['close'].resample('W', label='right').count()
volume_count = volume_bars['close'].resample('W', label='right').count()
dollar_count = dollar_bars['close'].resample('W', label='right').count()

#リスト型データtick_count, volume_count, dollar_countをDataFrame化する。
count_df = pd.concat([tick_count, volume_count, dollar_count], axis=1)
count_df.columns = ['tick', 'volume', 'dollar']
print(count_df)

#棒グラフを表示させる。
count_df.loc[:, ['tick', 'volume', 'dollar']].plot(kind='bar', figsize=[25, 5], color=('darkblue', 'green', 'darkcyan'))
plt.title('Number of bars over time', loc='center', fontsize=20, fontweight="bold", fontname="Times New Roman")

実行結果

test.txt

            tick  volume  dollar
date_time                       
2013-09-01     2       1       1
2013-09-08   353     273     180
2013-09-15   295     236     158
2013-09-22   341     265     181

上記グラフより、対象期間(2019/9/1-2013/9/22)においてVolume,dollarの棒グラフの高さがほぼ一定であることより、Volume,dollarは対象期間(2019/9/1-2013/9/22)においては大きく変化しておらず、ほぼ一定に推移していたことがわかる。

#2-5 変化率データの統計データ取得

tick_bars, volume_bars, dollar_barsの前日-当日変化率データ群を標本としてそれぞれについてジャックベラ検定を行った。tick_bars, volume_bars, dollar_barsのいずれも正規分布には従っていないと判断できる。

sample.py

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns
from statsmodels.graphics.tsaplots import plot_acf

#(1)tick_bars,volume_bars,dollar_barsの変化率のLogをとる。
tick_returns = np.log(tick_bars['close']).diff().dropna()
volume_returns = np.log(volume_bars['close']).diff().dropna()
dollar_returns = np.log(dollar_bars['close']).diff().dropna()

#(2)tick_returns, volume_returns, dollar_returnsのジャックベラ検定を行う。
print('Test Statistics:')
print('Tick:', '\t', stats.jarque_bera(tick_returns))
print('Volume: ', stats.jarque_bera(volume_returns))
print('Dollar: ', stats.jarque_bera(dollar_returns))


#(3)変数の置き換えを行う。
tick_diff = tick_returns
volume_diff = volume_returns
dollar_diff = dollar_returns

print("tick_diffの先頭5行を表示")
print(tick_diff.head())
print("volume_diff の先頭5行を表示")
print(volume_diff .head())
print("dollar_diffの先頭5行を表示")
print(dollar_diff.head())

#(4)Standardize the data
tick_standard = (tick_diff - tick_diff.mean()) / tick_diff.std()
volume_standard = (volume_diff - volume_diff.mean()) / volume_diff.std()
dollar_standard = (dollar_diff - dollar_diff.mean()) / dollar_diff.std()

#(5)Plot the Distributions
plt.figure(figsize=(16,12))
sns.kdeplot(tick_standard, label="Tick", color='darkblue')
sns.kdeplot(volume_standard, label="Volume", color='green')
sns.kdeplot(dollar_standard, label="Dollar", linewidth=2, color='darkcyan')
sns.kdeplot(np.random.normal(size=1000000), label="Normal", color='black', linestyle="--")
plt.xticks(range(-5, 6))
plt.legend(loc=8, ncol=5)
plt.title('Exhibit 1 - Partial recovery of Normality through a price sampling process \nsubordinated to a volume, tick, dollar clock',
          loc='center', fontsize=20, fontweight="bold", fontname="Times New Roman")
plt.xlim(-5, 5)
plt.show()

実行結果

test.txt


Test Statistics(ジャックベラ検定):
Tick: 	 (JB=12151.451482641653, p-value=0.0)
Volume:  (JB=9107.044443544652,  p-value=0.0)
Dollar:  (JB=5931.62812501079,  p-value=0.0)

tick_diffの先頭5行を表示
date_time
2013-09-01 21:29:57.152    0.000762
2013-09-02 01:28:59.673    0.001218
2013-09-02 02:22:33.934    0.000456
2013-09-02 03:07:48.372    0.000760
2013-09-02 04:07:27.960    0.000456
Name: close, dtype: float64

volume_diff の先頭5行を表示
date_time
2013-09-02 01:18:21.928    0.000913
2013-09-02 02:50:32.992    0.000760
2013-09-02 04:57:09.236    0.000760
2013-09-02 07:04:32.076    0.000911
2013-09-02 09:28:41.320   -0.000455
Name: close, dtype: float64

dollar_diffの先頭5行を表示
date_time
2013-09-02 02:56:24.209    0.002283
2013-09-02 06:37:33.128    0.001823
2013-09-02 09:34:46.141   -0.000304
2013-09-02 22:55:20.297    0.000607
2013-09-03 02:48:45.672   -0.000152
Name: close, dtype: float64

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up