はじめに

化合物の機会学習のベンチマークであるMolculeNetは、DeepChemにデータセットや、ベンチマークプログラムが搭載されている。
これを実行する方法について説明する。

動作環境

CentOS7
DeepChem 2.1

実行手順

前提として、DeepChemがインスト―ルされているものとします。

wgetコマンドでGithubにあるDeepCHemのexample配下にある以下のプログラムを取得します。

$wget https://raw.githubusercontent.com/deepchem/deepchem/master/examples/benchmark.py

ダウンロードしたbenchmark.pyで様々な分割方法、データセット、モデルでベンチマークを走らせることができます。
benchmark.pyの引数は以下の通りです。

-s 分割方法の指定
-d データセットの指定
-m モデルの指定
-t テストデータでの検証も行うかどうか

以下はランダムな分割で、Lipophilicityのデータに対し、xgboost回帰を行うベンチマークの実行例です。

(deepchem) [kimisyo@localhost qsar]$ python benchmark.py -s random -d lipo -m xgb_regression
/home/kimisyo/anaconda3/envs/deepchem/lib/python3.6/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
  from numpy.core.umath_tests import inner1d
-------------------------------------
Benchmark on dataset: lipo
-------------------------------------
Splitting function: random
Loading raw samples now.
shard_size: 8192
About to start loading CSV from /tmp/Lipophilicity.csv
Loading shard 1 of size 8192.
Featurizing sample 0
Featurizing sample 1000
Featurizing sample 2000
Featurizing sample 3000
Featurizing sample 4000
TIMING: featurizing shard 0 took 21.794 s
TIMING: dataset construction took 21.915 s
Loading dataset from disk.
TIMING: dataset construction took 0.162 s
Loading dataset from disk.
TIMING: dataset construction took 0.153 s
Loading dataset from disk.
TIMING: dataset construction took 0.074 s
Loading dataset from disk.
TIMING: dataset construction took 0.073 s
Loading dataset from disk.
/home/kimisyo/anaconda3/envs/deepchem/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
About to initialize singletask to multitask model
Initializing directory for task exp
-----------------------------
Start fitting: xgb_regression
About to create task-specific datasets
Splitting multitask dataset into singletask datasets
TIMING: dataset construction took 0.002 s
Loading dataset from disk.
Processing shard 0
        Task exp
Dataset for task exp has shape ((3360, 1024), (3360, 1), (3360, 1), (3360,))
Fitting model for task exp
computed_metrics: [0.9540359039831847]
computed_metrics: [0.5766624966302507]

このプログラムから実行する場合、metricがR^2になっているため、論文の結果(RMSE)との比較が
難しい。DeepChemのbenchmark関数を呼び出す際に、metricを以下のように引数であたえることで
mean_square_errorにmetricを変更することができるため、出力結果のルートをとることでRMSE
を算出することとができる。

for dataset in datasets:
  for split in splitters:
    for model in models:
      np.random.seed(seed)
      dc.molnet.run_benchmark(
#          [dataset], str(model), split=split, test=test, seed=seed)
           [dataset], str(model), split=split, test=test, seed=seed,
      metric=[deepchem.metrics.Metric(deepchem.metrics.mean_squared_error, np.mean)])

Lipophilicityのデータに対し、GraphConverationのRegressionを実行した結果は以下の通り。

 python benchmark.py -s random -d lipo -m graphconvreg -t
/home/kimisyo/anaconda3/envs/deepchem/lib/python3.6/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
  from numpy.core.umath_tests import inner1d
-------------------------------------
Benchmark on dataset: lipo
-------------------------------------
Splitting function: random
Loading dataset from disk.
Loading dataset from disk.
Loading dataset from disk.
/home/kimisyo/anaconda3/envs/deepchem/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
-----------------------------
Start fitting: graphconvreg
2018-11-19 02:09:25.805422: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
/home/kimisyo/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:98: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
computed_metrics: [0.04206545311187497]
computed_metrics: [0.5371709156355368]
computed_metrics: [0.38197098662567236]

math.sqrt(0.38) = 0.6164414002968976
ということで論文の値に近い数値がでている。

MolculeNetのベンチマークをDeepChemで実行する

はじめに

動作環境

実行手順