More than 3 years have passed since last update.

MoleculeNetのPDBBind、DeepChemのRDKitGridFeaturizerで遊んでみる

Last updated at 2020-05-29Posted at 2020-05-24

#はじめに
DeepChemにはMolculeNet( http://moleculenet.ai/ ) のデータが収納されており、このデータを利用すると化合物で機械学習を試すことができる。このMoleculeNetの各種データの中でひと際変わった輝きを放っているのが、「PDBBind」というデータである。
なぜ変わっているかというと、他のデータはsdfやsmilesに格納された「化合物データ」が学習データであるのに対し、これはタンパク質とリガンド(タンパク質に特異的に結合する化合物)の結合状態が、学習データとなっているからである。
当然ながら、Featurizerも他のものとは異なり、「RDKitGridFeaturizer」など通常のデータセットに対するものと異なるものが使われている。

今回は手始めにPDBBindデータセットをローカルにダウンロードし、RDKitGridFeaturierでFeaturizeをするところまでやってみた。

#環境

deepchem 2.2.1.dev54
pdbfixer

実行にはDeepChemに加えて、PDBFixerが必要だ。PDBFixerのインストール方法は以下の通り。

$ conda install -c omnia pdbfixer

#事前準備
http://moleculenet.ai/datasets-1 の「PDBBind」のリンクをクリックすると、データセットがダウンロードできるので、適当な場所に解凍しておく。容量はかなり大きい。ちなみに各データの形式は、リガンドがsdf(とmol2)、タンパク質がpdb形式になっている。

#ソース
deepchemのload_pdbdataset.pyというファイルを流用し、不要なところを削ってある。
元のソースをほぼそのまま流用しているので、RdkitGridFeaturizerだけでなく、ComplexNeighborListFragmentAtomicCoordinatesや、AtomicConvFeaturizer 等も指定できるようにしている。
RDKitGridFeaturizerは、feature_typeとして　'ecfp', 'splif', 'hbond', 'salt_bridge', 'pi_stack', 'cation_pi', 'charge' を指定できるようになっている。

パス等は自分の環境に読み替えてほしい。

deepchem_test.py

import os
import time
import numpy as np
from deepchem.feat import rdkit_grid_featurizer as rgf
from deepchem.feat.atomic_coordinates import ComplexNeighborListFragmentAtomicCoordinates
from deepchem.feat.graph_features import AtomicConvFeaturizer

def main():

    split = "random",
    featurizer = "grid"
    subset = "core"
    load_binding_pocket = True
    pdbbind_tasks = ["-logKd/Ki"]
    data_folder = "../data/v2015"

    if subset == "core":
        index_labels_file = os.path.join(data_folder, "INDEX_core_data.2013_small")
    elif subset == "refined":
        index_labels_file = os.path.join(data_folder, "INDEX_refined_data.2015")
    else:
        raise ValueError("Other subsets not supported")

    # Extract locations of data
    with open(index_labels_file, "r") as g:
        pdbs = [line[:4] for line in g.readlines() if line[0] != "#"]
        if load_binding_pocket:
            protein_files = [
            os.path.join(data_folder, pdb, "%s_pocket.pdb" % pdb) for pdb in pdbs
        ]
        else:
            protein_files = [
            os.path.join(data_folder, pdb, "%s_protein.pdb" % pdb) for pdb in pdbs
            ]
        ligand_files = [
        os.path.join(data_folder, pdb, "%s_ligand.sdf" % pdb) for pdb in pdbs
        ]

    # Extract labels
    with open(index_labels_file, "r") as g:
        labels = np.array([
            # Lines have format
            # PDB code, resolution, release year, -logKd/Ki, Kd/Ki, reference, ligand name
            # The base-10 logarithm, -log kd/pk
            float(line.split()[3]) for line in g.readlines() if line[0] != "#"
        ])

    # Featurize Data
    if featurizer == "grid":
        featurizer = rgf.RdkitGridFeaturizer(
        voxel_width=2.0,
         feature_types=[
           'ecfp', 'splif', 'hbond', 'salt_bridge', 'pi_stack', 'cation_pi', 'charge'
        ],
        flatten=False)
    elif featurizer == "atomic" or featurizer == "atomic_conv":
    # Pulled from PDB files. For larger datasets with more PDBs, would use
    # max num atoms instead of exact.

        frag1_num_atoms = 70  # for ligand atoms
        if load_binding_pocket:
            frag2_num_atoms = 1000
            complex_num_atoms = 1070
        else:
            frag2_num_atoms = 24000  # for protein atoms
            complex_num_atoms = 24070  # in total

        max_num_neighbors = 4
        # Cutoff in angstroms
        neighbor_cutoff = 4
        if featurizer == "atomic":
            featurizer = ComplexNeighborListFragmentAtomicCoordinates(
            frag1_num_atoms=frag1_num_atoms,
            frag2_num_atoms=frag2_num_atoms,
            complex_num_atoms=complex_num_atoms,
            max_num_neighbors=max_num_neighbors,
            neighbor_cutoff=neighbor_cutoff)
        if featurizer == "atomic_conv":
            featurizer = AtomicConvFeaturizer(
            labels=labels,
            frag1_num_atoms=frag1_num_atoms,
            frag2_num_atoms=frag2_num_atoms,
            complex_num_atoms=complex_num_atoms,
            neighbor_cutoff=neighbor_cutoff,
            max_num_neighbors=max_num_neighbors,
            batch_size=64)
    else:
        raise ValueError("Featurizer not supported")

    print("\nFeaturizing Complexes for \"%s\" ...\n" % data_folder)
    feat_t1 = time.time()
    features, failures = featurizer.featurize_complexes(ligand_files, protein_files)
    print(features.shape)
    print(features)
    feat_t2 = time.time()
    print("\nFeaturization finished, took %0.3f s." % (feat_t2 - feat_t1))


if __name__ == '__main__':
    main()

#実行
実行すると、features.shape, features にこんな形で表示される。
これは、5つの学習データがあり、1つの学習データが18944次元の特徴を持っていることを示している。

(5, 18944)
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

feature_typesの指定を変えるなどして feature_type毎の次元数を調べたところ、以下の通りであった。

'ecfp' 4096
'splif' 12288
'hbond 1536,
'salt_bridge 512
'pi_stack 0
'cation_pi 0
'charge' 512

実際には、1次元にflattenされているが、rgf.RdkitGridFeaturizer のflatten引数をFalseにすると元の次元を見ることができる。
salt_bridgeなどは、512次元が、元々 8 x 8 x 8 になっているため、3次元座標と関係しているのかもしれない。

今回はここまで。

#おわりに
今回は、データを準備して特徴の中身を見るところまで進んだものの、残念ながら自分のデータをつかって「遊ぶ」ところまではいかなかった。

最大の疑問として、タンパク質とリガンドの結合状態は、結合周辺のタンパク質/リガンドの原子や結合によると思われるが、これらは学習データにより種類や数が異なるにもかかわらず、全ての学習データのFeaturize結果が同じ次元になっているのはなぜか。その答えはRDKitGridFeaturizer のソースを見るしかないだろう。

1次元の学習データであれば、DeepLearning 以外でも予測はできるため、MoleculeNet に対抗して、従来の機械学習手法で予測してみても面白いかもしれない。
また、RDKitGridFeaturizerを参考に、自分のアイデアを反映したFeaturizerを作っても面白いかもしれない。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up