More than 3 years have passed since last update.

あなたのおっしゃる「タニモト係数」って、どれのことかしら？

Last updated at 2020-12-16Posted at 2020-05-27

化合物の類似性を語るにあたって、有名な指標に「タニモト係数」があります。ところが、「タニモト係数」とだけしか言わない人がけっこういて、「それだけでは説明不十分ですよね？」って悶々とする機会がけっこうあるわけです。

たとえば次の化合物セットを使いましょう

化合物数は１０個です。

smiles = [
    'C1=CC=C2C3CC(CNC3)CN2C1=O',
    'CN1c2c(C(N(C)C1=O)=O)[nH0](CC(CO)O)c[nH0]2',
    'CN1C2CC(CC1C1C2O1)OC(C(c1ccccc1)CO)=O',
    'CN1C2CC(CC1C1C2O1)OC(C(c1cccnc1)CO)=O', # 一つ上の化合物と類似
    'CN(C=1C(=O)N(c2ccccc2)N(C1C)C)C',
    'CN(C=1C(=O)N(C2CCCCC2)N(C1C)C)C', # 一つ上の化合物と類似
    'OCC1C(C(C(C(OCC2C(C(C(C(OC(c3ccccc3)C#N)O2)O)O)O)O1)O)O)O',
    'OCc1ccccc1OC1C(C(C(C(CO)O1)O)O)O',
    'OCc1cc(N)ccc1OC1C(C(C(C(CO)O1)O)O)O', # 一つ上の化合物と類似
    '[nH0]1c(OC)c2c([nH0]cc[nH0]2)[nH0]c1',
]

SMILES表記からRDKitで化合物を生成

from rdkit import Chem

mols = [Chem.MolFromSmiles(smile) for smile in smiles]

Morgan fingerprint (の一種)を生成

from rdkit.Chem import AllChem

fps = [AllChem.GetMorganFingerprint(mol, 3, useFeatures=True) for mol in mols]

タニモト係数を計算

from rdkit import DataStructs

sim_matrix = [DataStructs.BulkTanimotoSimilarity(fp, fps) for fp in fps]

タニモト係数の分布

このようにすれば、タニモト係数の分布が見られます。タニモト係数 = 1 のものは同一分子だと思われますが、それ以外でタニモト係数が高めの分子対がありますね。それが何なのか、調べてみてください（あるいは想像してみてください）

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

plt.hist(np.array(sim_matrix).flatten(), bins=20)
plt.grid()
plt.show()

で、Morgan fingerprint にも色々ありまして

Morgan fingerprintもパラメータがいろいろあって、それを変えるとタニモト係数の数値も変わるわけです。

fps = [AllChem.GetMorganFingerprint(mol, 2, useFeatures=True) for mol in mols]
sim_matrix = [DataStructs.BulkTanimotoSimilarity(fp, fps) for fp in fps]
plt.hist(np.array(sim_matrix).flatten(), bins=20)
plt.grid()
plt.show()

fps = [AllChem.GetMorganFingerprint(mol, 1, useFeatures=True) for mol in mols]
sim_matrix = [DataStructs.BulkTanimotoSimilarity(fp, fps) for fp in fps]
plt.hist(np.array(sim_matrix).flatten(), bins=20)
plt.grid()
plt.show()

fps = [AllChem.GetMorganFingerprintAsBitVect(mol, 3, 1024) for mol in mols]
sim_matrix = [DataStructs.BulkTanimotoSimilarity(fp, fps) for fp in fps]
plt.hist(np.array(sim_matrix).flatten(), bins=20)
plt.grid()
plt.show()

fps = [AllChem.GetMorganFingerprintAsBitVect(mol, 3, 2048) for mol in mols]
sim_matrix = [DataStructs.BulkTanimotoSimilarity(fp, fps) for fp in fps]
plt.hist(np.array(sim_matrix).flatten(), bins=20)
plt.grid()
plt.show()

で、フィンガープリントは他にもありまして

全部計算するのメンドくさいので２つだけ。重要なことの一つとして、下記の計算結果ではタニモト係数が1.0のものが１０件以上出てきます。つまり、同一分子でなくてもタニモト係数が1.0になる場合があることは知っておきましょう。

fps = [AllChem.GetMACCSKeysFingerprint(mol) for mol in mols]
sim_matrix = [DataStructs.BulkTanimotoSimilarity(fp, fps) for fp in fps]
plt.hist(np.array(sim_matrix).flatten(), bins=20)
plt.grid()
plt.show()

fps = [Chem.RDKFingerprint(mol) for mol in mols]
sim_matrix = [DataStructs.BulkTanimotoSimilarity(fp, fps) for fp in fps]
plt.hist(np.array(sim_matrix).flatten(), bins=20)
plt.grid()
plt.show()

MCSを用いたタニモト係数も定義できるわけでして

from rdkit.Chem import rdFMCS

matrix = []
for mol1 in mols:
    for mol2 in mols:
        mcs = rdFMCS.FindMCS([mol1, mol2])
        a1 = len(mol1.GetAtoms())
        a2 = len(mol2.GetAtoms())
        matrix.append(mcs.numAtoms / (a1 + a2 - mcs.numAtoms) )

plt.hist(np.array(matrix).flatten(), bins=20)
plt.grid()
plt.show()

MCSにもいろんなオプションがあるわけでして

from rdkit.Chem import rdFMCS

matrix = []
for mol1 in mols:
    for mol2 in mols:
        mcs = rdFMCS.FindMCS([mol1, mol2], atomCompare=rdFMCS.AtomCompare.CompareAny)
        a1 = len(mol1.GetAtoms())
        a2 = len(mol2.GetAtoms())
        matrix.append(mcs.numAtoms / (a1 + a2 - mcs.numAtoms) )

plt.hist(np.array(matrix).flatten(), bins=20)
plt.grid()
plt.show()

from rdkit.Chem import rdFMCS

matrix = []
for mol1 in mols:
    for mol2 in mols:
        mcs = rdFMCS.FindMCS([mol1, mol2])
        a1 = len(mol1.GetBonds())
        a2 = len(mol2.GetBonds())
        matrix.append(mcs.numBonds / (a1 + a2 - mcs.numBonds) )

plt.hist(np.array(matrix).flatten(), bins=20)
plt.grid()
plt.show()

from rdkit.Chem import rdFMCS

matrix = []
for mol1 in mols:
    for mol2 in mols:
        mcs = rdFMCS.FindMCS([mol1, mol2], bondCompare=rdFMCS.BondCompare.CompareOrderExact)
        a1 = len(mol1.GetBonds())
        a2 = len(mol2.GetBonds())
        matrix.append(mcs.numBonds / (a1 + a2 - mcs.numBonds) )

plt.hist(np.array(matrix).flatten(), bins=20)
plt.grid()
plt.show()

で、あなたのおっしゃる「タニモト係数」って、どれのことかしら？

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up