More than 5 years have passed since last update.

機械学習のモデルをpythonのパッケージに含める試み

Python

Last updated at 2020-06-10Posted at 2020-06-10

Overview

機械学習周りの開発から運用に関わる素敵な知見を、メルカリの中の人が公開してくれていますが、
https://mercari.github.io/ml-system-design-pattern/README_ja.html
この中の、モデルの管理まわりについて少し広げてみます。

最初に全体像を示します。
https://github.com/arc279/model-in-package-sample
というかもうこれで全部なんですけど。

以下要点の解説です。
setuptools とかの説明はしないので必要に応じて各自ググってください。

なお、サンプルの実行環境は

(.venv) $ python -V
Python 3.8.1

でお送りしています。

python のパッケージにはソース以外のデータも含められる

ちょっと前は setuptools 界隈で package_data とか data_files とかいろいろややこしいことになってたんですが、
最近は MANIFEST.in と importlib.resources に収束してきたみたいです。

なお、importlib.resources が追加されたのは python 3.7 からで、古めのバージョンだと pkg_resources とかを使う必要があります。
正直使い勝手は悪いので、なるべく 3.7 以降で importlib.resources を使いましょう。

使い方はこの辺見てください。

https://github.com/arc279/model-in-package-sample/blob/master/MANIFEST.in
https://github.com/arc279/model-in-package-sample/blob/master/src/mymodel/__init__.py#L5

wheel に固めるとこんな感じ

*.pkl ファイルが入ってますね。

$ python setup.py bdist_wheel

(..snip..)

$ zipinfo -1 dist/mymodel-1.1.1_titanic.from_kaggle-py3-none-any.whl
mymodel/__init__.py
mymodel/version.py
mymodel/titanic_sample/__init__.py
mymodel/titanic_sample/models/__init__.py
mymodel/titanic_sample/models/LogisticRegression/__init__.py
mymodel/titanic_sample/models/LogisticRegression/model.pkl
mymodel/titanic_sample/models/RandomForestClassifier/__init__.py
mymodel/titanic_sample/models/RandomForestClassifier/model.pkl
mymodel/titanic_sample/models/SVC/__init__.py
mymodel/titanic_sample/models/SVC/model.pkl
mymodel/titanic_sample/models/SVC/__pycache__/__init__.cpython-38.pyc
mymodel/titanic_sample/models/__pycache__/__init__.cpython-38.pyc
mymodel-1.1.1_titanic.from_kaggle.dist-info/METADATA
mymodel-1.1.1_titanic.from_kaggle.dist-info/WHEEL
mymodel-1.1.1_titanic.from_kaggle.dist-info/top_level.txt
mymodel-1.1.1_titanic.from_kaggle.dist-info/RECORD

使う側

wheel に固めると pip で入れられるようになります。

(.venv) $ pip install dist/mymodel-1.1.1_titanic.from_kaggle-py3-none-any.whl

(..snip..)

(.venv) $ pip list
Package         Version
--------------- -------------------------
joblib          0.15.1
mymodel         1.1.1-titanic.from-kaggle
numpy           1.18.5
pandas          1.0.4
pip             19.2.3
python-dateutil 2.8.1
pytz            2020.1
scikit-learn    0.23.1
scipy           1.4.1
setuptools      41.2.0
six             1.15.0
threadpoolctl   2.1.0
wheel           0.34.2

呼び出す

(.venv) $ ipython
In [1]: import mymodel

In [2]: mymodel.__version__
Out[2]: '1.1.1-titanic.from-kaggle'

パッケージ内のデータを読み出す

ipython の続きです。

In [3]: import importlib.resources

In [4]: import pickle

In [5]: import mymodel.titanic_sample.models.LogisticRegression

In [6]: b = importlib.resources.read_binary(mymodel.titanic_sample.models.LogisticRegression, "model.pkl")

In [9]: len(b)
Out[9]: 739

In [10]: c = pickle.loads(b)

In [11]: c.__class__
Out[11]: sklearn.linear_model._logistic.LogisticRegression

できてますね。
詳細はこの辺見てください。

ちなみに

上記の要点を抑えると、 データだけ含めたpythonパッケージ も可能、っていうインプリケーションが得られるわけです。
どのへんまで含めるのが丁度いいかはプロジェクトによって異なると思うので、いろいろ検討できますね。

最後にバージョニングについて

python のパッケージのバージョン規約はわりとユルくて、セマンティックバージョニングが採用できます。
なので、メルカリのこの例がそのまま使えますね。
https://mercari.github.io/ml-system-design-pattern/Operation-patterns/Data-model-versioning-pattern/design_ja.html

こんな感じ。
https://github.com/arc279/model-in-package-sample/blob/master/setup.cfg#L3
https://github.com/arc279/model-in-package-sample/blob/master/src/mymodel/version.py

っていう話なんですけど。
全体像はサンプルのgithub 見てください。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up