More than 5 years have passed since last update.

dask xgboostで分散学習してirisを分類

Last updated at 2018-03-29Posted at 2018-03-29

以下を参考に、これを dask で分散学習してみます。
https://www.kdnuggets.com/2017/03/simple-xgboost-tutorial-iris-dataset.html

一式コードはgistに置いときます。
https://gist.github.com/arc279/fc5fce1eccbfbb6e464683c3e9409c4b

pip install

準備。

pip install --upgrade pandas sklearn dask xgboost dask_xgboost

daskが古いと

AttributeError: module ‘pandas.core.computation’ has no attribute ‘expressions’

とか出てimport時にエラーになるので、--upgrade しておく。

freezeの図

click==6.7
cloudpickle==0.5.2
dask==0.17.2
dask-xgboost==0.1.5
distributed==1.21.4
HeapDict==1.0.0
msgpack-python==0.5.6
numpy==1.14.2
pandas==0.22.0
psutil==5.4.3
python-dateutil==2.7.2
pytz==2018.3
scikit-learn==0.19.1
scipy==1.0.1
six==1.11.0
sklearn==0.0
sortedcontainers==1.5.9
tblib==1.3.2
toolz==0.9.0
tornado==5.0.1
xgboost==0.7.post4
zict==0.1.3

dataset の準備

sklearn に付属の iris を train_test_split で分割。

from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
y = iris.target

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

学習

daskで分散させるための準備。
全部ローカルで実行してますが、tcpでネットワーク越しでもたぶん大丈夫のはず。

Rabit: Reliable Allreduce and Broadcast Interface っていう仕組みらしい。

pipでdask入れると一緒にコマンドラインツールが入ってるはずなので、

scheduler 起動

別のターミナルを起動して

dask-scheduler

デフォルトだと port 8786 で立ち上がる。
ネゴシエータとかZooKeeperみたいな役割。

worker 起動

更に別のターミナルを複数起動して

dask-worker 127.0.0.1:8786

clientを作ってschedulerに接続

以下はworker2つで試した例。

from dask.distributed import Client
client = Client('127.0.0.1:8786')
print(client)

output

<Client: scheduler='tcp://10.0.1.57:8786' processes=2 cores=8>

分散させるためにdaskのDataFrameに変換

daskはメモリに乗り切らないサイズでも分散処理できるので、
chunksize はデータセットをいくつの塊に分割するかの指定。

とりあえず今回は適当に5とか指定。

import dask.dataframe as dd
X_train_dd = dd.from_array(X_train, columns=iris.feature_names, chunksize=5)
y_train_dd = dd.from_array(y_train, chunksize=5)

params = {
    'max_depth': 3,
    'eta': 0.3,
    'objective': 'multi:softprob',
    'num_class': 3}

import dask_xgboost as dxgb
bst = dxgb.train(client, params, X_train_dd, y_train_dd)
print(bst)

dxgb.train が分散学習してる箇所。

output

<xgboost.core.Booster object at 0x113dfad30>

trainの戻り値は素の xgboost の Booster なので、modelを保存しておく

bst.save_model('model.xgb')

予測

予測も分散できるようだけど、今回は普通に。
分散予測の例は github のExampleに書いてある。

保存したModelをClassifierに設定する
LabelEncoder でラベルを付与する

の2点が必要。

import xgboost as xgb
from sklearn.preprocessing import LabelEncoder

clf = xgb.XGBClassifier()

booster = xgb.Booster()
booster.load_model('./model.xgb')
clf._Booster = booster

clf._le = LabelEncoder().fit(iris.target_names)

import pandas as pd

X_test_dd = pd.DataFrame(data=X_test, columns=iris.feature_names)
print(clf.predict(X_test_dd))

おわり。

理論とか詳しい話はわからないので各自でググってください（逃亡

参考

https://github.com/dmlc/xgboost
https://github.com/dask/dask-xgboost
https://github.com/dmlc/xgboost/issues/706
http://xgboost.readthedocs.io/en/latest/python/python_api.html

Early Stopping の話

dask is 何

https://github.com/dask/dask
http://sinhrks.hatenablog.com/entry/2015/09/24/222735

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up