LoginSignup
3
2

More than 3 years have passed since last update.

SMILES表データから様々なデータを一発で生成する

Last updated at Posted at 2021-02-08

時間がない人向け

コピペで使えて、SMILESのテーブルデータから色々なデータを一発で生成するpythonクラスを書きました。
MOLオブジェクト、RDKit記述子、原子位置情報(deepchemで言うところのSmilesToImage)に対応しています。
他の記述子とかも同じ要領で書けるはずです。
個人的な備忘録なのでツッコミどころ多くても許してください。

はじめに

ケモインフォマティクスは化合物の情報とコンピュータを用いて様々な課題を解決しようとする学問領域です。
近年の機械学習などの手法の発展により、化合物情報を用いて物性を高度に予測できるようになってきました。
最近ではAlphaFold2の躍進などが有名1

SMILESは有名・簡易な化合物表記方法の一種で、人間とコンピュータが共に理解しやすい文字列で化合物を表現します2
SMILESデータを用いることで比較的簡単に回帰や分類問題を解くことができるので、しばしばケモインフォマティクス入門に取り挙げられます3
今回はSMILESを用いたケモインフォマティクスコンペティションに参加したので、その備忘録として、役立ちそうなスクリプトをアウトプットすることにしました。

実装

  • Python: 3.7.4
  • rdkit: 2019.09.3.0
  • numpy: 1.18.1
  • pandas: 1.0.1
utils.py
from dataclasses import dataclass, field
from typing import ClassVar, List, Dict, Tuple
import sys
import csv
from tqdm import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Descriptors
from rdkit.Chem import MolStandardize
from rdkit.ML.Descriptors import MoleculeDescriptors

@dataclass
class data_loader:    
    # 実験用
    data_file: str # ファイルパス
    task: str # task名
    preprocessing: bool = False # MOLオブジェクト前処理一括変更用
    normalize: bool = True # MOLオブジェクトを標準化するかどうか
    saltremove: bool = True # MOLオブジェクトを脱塩するかどうか
    uncharge: bool = True # MOLオブジェクトを中和するかどうか

    # 固定
    df: pd.DataFrame = None 
    names: List[str] = field(default_factory=list)

    def data_loading(self):
        self.df = pd.DataFrame()
        if self.data_file:
            with open(self.data_file, 'r') as f:
                reader = list(csv.reader(f))[1:] # 1行目がカラムなので
                smiles_list = [row[0] for row in reader]
                y = [float(row[1]) for row in reader]

            self.df[self.task] = y

        else:
            reader = sys.stdin
            smiles_list = [row.split(",")[0] for row in reader if row.split(",")[0] != 'SMILES']
            #print(smiles_list)
            #return smiles_list  
        self.df['SMILES'] = smiles_list

    def _normalize(self, mol):
        normalizer = MolStandardize.normalize.Normalizer()
        mol = normalizer.normalize(mol)
        return mol

    def _saltremove(self, mol):
        lfc = MolStandardize.fragment.LargestFragmentChooser()
        mol = lfc.choose(mol)
        return mol

    def _uncharge(self, mol):
        uc = MolStandardize.charge.Uncharger()
        mol = uc.uncharge(mol)
        return mol

    def _mol2img(self, mol, mode='std', img_size = 96, res = 0.5, max_len = 300):
        embed = int(img_size * res /2)

        cmol = Chem.Mol(mol.ToBinary())
        cmol.ComputeGasteigerCharges()
        AllChem.Compute2DCoords(cmol)
        atom_coords = cmol.GetConformer(0).GetPositions()

        img = np.zeros((img_size, img_size, 1))
        # Compute bond properties

        try:
            if mode == 'std':
                img = np.zeros((img_size, img_size, 1))

                bond_props = np.array([[2.0, bond.GetBeginAtomIdx(),
                                    bond.GetEndAtomIdx()] for bond in mol.GetBonds()])
                # Compute atom properties
                atom_props = np.array([[atom.GetAtomicNum()] for atom in cmol.GetAtoms()])

                bond_props = bond_props.astype(np.float32)
                atom_props = atom_props.astype(np.float32)
            else:
                img = np.zeros((img_size, img_size, 4))

                # Compute bond properties
                bond_props = np.array([[
                  bond.GetBondTypeAsDouble(),
                  bond.GetBeginAtomIdx(),
                  bond.GetEndAtomIdx()
                ] for bond in mol.GetBonds()])
                # Compute atom properties
                atom_props = np.array([[
                  atom.GetAtomicNum(),
                  atom.GetProp("_GasteigerCharge"),
                  atom.GetExplicitValence(),
                  atom.GetHybridization().real,
                ] for atom in cmol.GetAtoms()])

                bond_props = bond_props.astype(np.float32)
                atom_props = atom_props.astype(np.float32)

                partial_charges = atom_props[:, 1]
                #if np.any(np.isnan(partial_charges)):
                #    return np.array([])

            frac = np.linspace(0, 1, int(1 / res * 2))
            # Reshape done for proper broadcast
            frac = frac.reshape(-1, 1, 1)

            bond_begin_idxs = bond_props[:, 1].astype(int)
            bond_end_idxs = bond_props[:, 2].astype(int)

            # Reshapes, and axes manipulations to facilitate vector processing.
            begin_coords = atom_coords[bond_begin_idxs]
            begin_coords = np.expand_dims(begin_coords.T, axis=0)
            end_coords = atom_coords[bond_end_idxs]
            end_coords = np.expand_dims(end_coords.T, axis=0)

            # Draw a line between the two atoms.
            # The coordinates of this line, are indicated in line_coords
            line_coords = frac * begin_coords + (1 - frac) * end_coords
            # Turn the line coordinates into image positions
            bond_line_idxs = np.ceil(
                (line_coords[:, 0] + embed) / res).astype(int)
            bond_line_idys = np.ceil(
                (line_coords[:, 1] + embed) / res).astype(int)
            # Set the bond line coordinates to the bond property used.
            img[bond_line_idxs, bond_line_idys, 0] = bond_props[:, 0]

            # Turn atomic coordinates into image positions
            atom_idxs = np.round(
                (atom_coords[:, 0] + embed) / res).astype(int)
            atom_idys = np.round(
                (atom_coords[:, 1] + embed) / res).astype(int)
            # Set the atom positions in image to different atomic properties in channels
            img[atom_idxs, atom_idys, :] = atom_props
        except :
            sys.stderr.write(Chem.MolToSmiles(mol)+'\n')

        return img

    @property
    def RDKitDescriptor(self):
        if not 'SMILES' in self.df.columns:
            self.data_loading()

        self.names = [x[0] for x in Descriptors._descList]
        calc = MoleculeDescriptors.MolecularDescriptorCalculator(self.names)

        descriptors = [np.array(list(np.nan for i in range(len(self.names)))) if mol is None else calc.CalcDescriptors(mol) for mol in tqdm(self.mol_list)]
        IPC = [np.nan if mol is None else Descriptors.Ipc(mol, avg=True) for mol in tqdm(self.mol_list)]

        tmp = pd.DataFrame(descriptors, columns=self.names)
        tmp['Ipc'] = pd.Series(IPC,name='Ipc')

        self.df[self.names] = tmp

        return tmp.values

    @property
    def ChemImage(self):
        #https://github.com/deepchem/deepchem/blob/master/deepchem/feat/molecule_featurizers/smiles_to_image.py#L11-L165
        if not 'MOL' in self.df.columns:
            self.mol_list

        max_len = max(len(smiles) for smiles in self.smiles_list)
        self.df['ChemImage'] = self.df['MOL'].apply(lambda mol: self._mol2img(mol=mol, max_len=max_len))

        return np.array(list(self.df['ChemImage']))


    @property
    def smiles_list(self) -> pd.DataFrame:
        if not 'SMILES' in self.df.columns:
            self.data_loading()
        return self.df['SMILES'].to_list()

    @property
    def y(self) -> pd.DataFrame:
        if not 'SMILES' in self.df.columns:
            self.data_loading()
        return np.array(self.df[self.task].to_list())

    @property
    def mol_list(self):
        if not 'SMILES' in self.df.columns:
            self.data_loading()
        elif not 'MOL' in self.df.columns:
            self.df['MOL'] = self.df['SMILES'].apply(Chem.MolFromSmiles)
            mol_list = self.df['MOL'].to_list()
            if self.preprocessing:
                if self.normalize:
                    mol_list = [None if mol is None else self._normalize(mol) for mol in mol_list]
                if self.saltremove:
                    mol_list = [None if mol is None else self._saltremove(mol) for mol in mol_list]
                if self.uncharge:
                    mol_list = [None if mol is None else self._uncharge(mol) for mol in mol_list]
                if not (self.normalize or self.saltremove or self.uncharge):
                    pass
            self.df['MOL'] = mol_list
        else:
            pass 
        return self.df['MOL'].to_list()

    """
    @property
    def feature_list(self, feature, function):
        if not 'SMILES' in self.df.columns:
            self.data_loading()
        elif not feature in self.df.columns:
            self.df[feature] = self.df['SMILES'].apply(function)
        else:
            pass
        return self.df['feature'].to_list()
    """

    @property
    def get_df(self):
        if not 'SMILES' in self.df.columns:
            self.data_loading()

        if 'MOL' not in self.df.columns:
            self.mol_list

        if not self.names:
            self.RDKitDescriptor

        if 'ChemImage' not in self.df.columns:
            self.ChemImage

        return self.df

使い方

サンプルデータのダウンロード

Ames試験のオープンソースデータを保存して一部抜粋しました。

#https://doc.ml.tu-berlin.de/toxbenchmark/
dataset = pd.read_csv('Mutagenicity_N6512.csv')
data = pd.DataFrame()
data['SMILES'] = dataset['Canonical_Smiles'] \
    .str.replace('\\','') \
    .str.replace('/','') \
    .iloc[:100]
data['y'] = dataset['Activity']
data.to_csv('data.csv', index=False)
display(data.head())
SMILES y
0 O=C1c2ccccc2C(=O)c3c1ccc4c3[nH]c5c6C(=O)c7cccc... 0
1 NNC(=O)CNC(=O)C=N#N 1
2 O=C1NC(=O)C(=N#N)C=N1 1
3 NC(=O)CNC(=O)C=N#N 1
4 CCCCN(CC(O)C1=CC(=N#N)C(=O)C=C1)N=O 1

データ生成

data_file: 保存したcsvデータのファイル名
task: yに当たるカラム名
SMILES: SMILES。そのまま。
MOL: SMILESから生成したMOLオブジェクト
MaxEStateIndex~fr_urea: RDKit記述子
ChemImage: 原子位置情報


data_file = 'data.csv'
task = 'Activity'

dataset = data_loader(data_file, task=task)
dataset.data_loading()
df = dataset.get_df
display(df.head())

表データ

Activity SMILES MOL MaxEStateIndex MinEStateIndex MaxAbsEStateIndex ... MinAbsEStateIndex qed MolWt fr_unbrch_alkane fr_urea ChemImage
0 0 O=C1c2ccccc2C(=O)c3c1ccc4c3[nH]c5c6C(=O)c7ccccc7C(=O)c6c8[nH]c9c%10C(=O)c%11ccccc%11C(=O)c%10ccc9c8c45 14.4310727 -0.3785752 14.4310727 ... 0.12898832 0.18458152 646.614 0 0 [[[0.] [0.] [0.] ... [0.] [0.] [0.]] [[0.] [0.] [0.] ... [0.] [0.] [0.]] [[0.] [0.] [0.] ... [0.] [0.] [0.]] ... [[0.] [0.] [0.] ... [0.] [0.] [0.]] [[0.] [0.] [0.] ... [0.] [0.] [0.]] [[0.] [0.] [0.] ... [0.] [0.] [0.]]]
1 1 NNC(=O)CNC(=O)C=N#N 10.4221967 -0.6702778 10.4221967 ... 0.24865741 0.10384883 157.133 0 0 [[[0.] [0.] [0.] ... [0.] [0.] [0.]] [[0.] [0.] [0.] ... [0.] [0.] [0.]] [[0.] [0.] [0.] ... [0.] [0.] [0.]] ... [[0.] [0.] [0.] ... [0.] [0.] [0.]] [[0.] [0.] [0.] ... [0.] [0.] [0.]] [[0.] [0.] [0.] ... [0.] [0.] [0.]]]
2 1 O=C1NC(=O)C(=N#N)C=N1 10.5217593 -0.7524074 10.5217593 ... 0.27814815 0.34214759 138.086 0 1 [[[0.] [0.] [0.] ... [0.] [0.] [0.]] [[0.] [0.] [0.] ... [0.] [0.] [0.]] [[0.] [0.] [0.] ... [0.] [0.] [0.]] ... [[0.] [0.] [0.] ... [0.] [0.] [0.]] [[0.] [0.] [0.] ... [0.] [0.] [0.]] [[0.] [0.] [0.] ... [0.] [0.] [0.]]]
3 1 NC(=O)CNC(=O)C=N#N 10.3193056 -0.665463 10.3193056 ... 0.25907407 0.26686523 142.118 0 0 [[[0.] [0.] [0.] ... [0.] [0.] [0.]] [[0.] [0.] [0.] ... [0.] [0.] [0.]] [[0.] [0.] [0.] ... [0.] [0.] [0.]] ... [[0.] [0.] [0.] ... [0.] [0.] [0.]] [[0.] [0.] [0.] ... [0.] [0.] [0.]] [[0.] [0.] [0.] ... [0.] [0.] [0.]]]
4 1 CCCCN(CC(O)C1=CC(=N#N)C(=O)C=C1)N=O 11.230461 -0.9762443 11.230461 ... 0.034298 0.24183825 264.285 0 0 [[[0.] [0.] [0.] ... [0.] [0.] [0.]] [[0.] [0.] [0.] ... [0.] [0.] [0.]] [[0.] [0.] [0.] ... [0.] [0.] [0.]] ... [[0.] [0.] [0.] ... [0.] [0.] [0.]] [[0.] [0.] [0.] ... [0.] [0.] [0.]] [[0.] [0.] [0.] ... [0.] [0.] [0.]]]

画像データ

Chem.Draw.MolToImage(df['MOL'][0])

image.png

原子位置情報

plt.imshow(df['ChemImage'][0])

image.png

解説

data_loaderをインスタンス化し、各種メソッドでSMILESからデータを生成します。
get_dfメソッドで一発で生成できますが、個別にも取ってこられます。

RDKit記述子の生成はPandasToolsを使えばもっと高速化すると思いますが、ちまちま生成する方が中間処理(エラー処理や化合物の前処理など)で色々できるので、個人的には好きです。

原子位置情報を訓練データとしてCNNなどのモデルに学習させることが可能みたいです4
ただし、数千規模のデータだと無理そうでした。

分子フィンガープリントMordred記述子など、他のデータも同様に生成できると思います。

deepchemでは様々な記述子の生成が楽に実装できるので、組み合わせると色々できそう。
ただし、エラーが頻発する印象が強い。

以上、お読みいただきありがとうございました。

3
2
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
3
2