More than 3 years have passed since last update.

JupyterLab+matplotlibのグラフ上に化学構造式画像をホバー表示し、らくらく探索的データ分析

Last updated at 2021-11-01Posted at 2021-10-30

はじめに

Jupyter+matplotlibの環境は、化合物データにおけるデータ解析、可視化に大変便利ではあるが、一つ困ったことに、可視化したグラフ上のプロットが何の化合物か、どんな構造かを知ることが簡単にはできない。

商用ソフトであれば当然のように提供されるこの機能を、是非JupyterLab+matplotlib環境でも実現したい! ということで色々調べた結果、やり方を見つけたので共有したい。

環境

python 3.6
matplotlib 3.2
jupyterlab 3.0.14
RDKit 2020.09.3
mordred 1.2
scikit-learn 0.24.2

やりたいこと

グラフ上の化合物のプロットをマウスオーバすると、その化合物に対応する構造式画像と名前をホバー表示したい。
図にするとこんな感じだ。

実現方法

データの読み込みから可視化まで順を追って手順を説明していく。
まずは必要モジュールをインポートしよう

import numpy as np
import pandas as pd
from rdkit import rdBase, Chem
from rdkit.Chem import AllChem, Descriptors, Draw
from rdkit.ML.Descriptors import MoleculeDescriptors
from mordred import Calculator, descriptors
from matplotlib.offsetbox import OffsetImage, AnnotationBbox, TextArea
import matplotlib.pyplot as plt

次にデータを読み込んで、SMILESをMOLオブジェクトに変換する。
ここでは"Compound ID"列に化合物名が、"smiles" 列にSMILES形式でデータた格納されているcsvファイルを使うこととする。

df = pd.read_csv("./delaney-processed.csv")

# SMILESをMOLオブジェクトに変換
mols = []
for smiles in df["smiles"]:
    mol = Chem.MolFromSmiles(smiles)
    mols.append(mol)

データを可視化するために、記述子計算をして主成分分析を行う。
まずはmordredで記述子計算をする。

# mordredの記述子計算
mordred_calculator = Calculator(descriptors, ignore_3D=False)
df_mordred = mordred_calculator.pandas(pd.Series(mols, index=df.index))

主成分分析のため、数値型への変換や、null列を含んだ列を除外した上でオートスケーリングを行う。

# 数値型への変換
for column in df_mordred.columns:
    if df_mordred[column].dtypes == object:
        df_mordred[column] = df_mordred[column].values.astype(np.float32)
    else:    
        pass

# 数値変換後のnullの列チェック
df_mordred = df_mordred[df_mordred.columns[~df_mordred.isnull().any()]]

# オートスケーリング
from sklearn.preprocessing import StandardScaler
ss = StandardScaler() 
targets_scaling = ss.fit_transform(df_mordred.values)

準備ができたので主成分分析をしよう。これで可視化のデータの準備は完了。

# 主成分分析
from sklearn.decomposition import PCA 
pca_model = PCA(n_components=2)
Z = pca_model.fit_transform(targets_scaling)

続いてホバー表示する化学構造式画像を準備する。

# 化学構造式画像の生成
images = []
for smiles in df["smiles"]:
    images.append(Draw.MolToImage(Chem.MolFromSmiles(smiles), size=(128, 128)))

さぁ、いよいよ可視化だ。そのコードは以下の通りである。コードの解説は後程行う。

# グラフのプロット
%matplotlib widget

# 　グラフ領域の生成
fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(1,1,1,)

# 画像表示用の領域
imagebox = OffsetImage(images[0], zoom=1.0)
imagebox.image.axes = ax

# 画像表示用のアノテーションボックス
annotationBoxImage = AnnotationBbox(imagebox, xy=(0, 0), xybox=(28, 28),
                        xycoords="data", boxcoords="offset points", pad=0.5,
                        arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=-0.3"))

annotationBoxImage.set_visible(False)
ax.add_artist(annotationBoxImage)

# テキスト表示用の領域
textbox = TextArea("Test")

# テキスト表示用のアノテーションボックス
annotationBoxText = AnnotationBbox(textbox, xy=(0, 0), xybox=(0.1, 50), 
                    xycoords='data', boxcoords=("axes fraction", "data"),
                    box_alignment=(0., 0.5))

annotationBoxText.set_visible(False)
ax.add_artist(annotationBoxText)

sc = plt.scatter(Z[:, 0], Z[:, 1], c="r", alpha=0.7)

# アノテーションを更新する
def update_annotation(index):
    i = index["ind"][0]
    pos = sc.get_offsets()[i]
    #sc.get_linestyles
    if pos[0] < 100:
        annotationBoxImage.xy = (pos[0] + 30, pos[1])
    else:
        annotationBoxImage.xy = (pos[0] - 50, pos[1])  
    imagebox.set_data(images[i])
    textbox.set_text(df["Compound ID"].values[i])

    
# ホバー時に呼び出される関数
def hover(event):
    visible = annotationBoxImage.get_visible()
    if event.inaxes == ax:
        contain, index = sc.contains(event)
        if contain:
            update_annotation(index)            
            annotationBoxImage.set_visible(True)
            annotationBoxText.set_visible(True)
            fig.canvas.draw_idle()
        else:
            if visible:
                annotationBoxImage.set_visible(False)
                annotationBoxText.set_visible(False)
                fig.canvas.draw_idle()

fig.canvas.mpl_connect("motion_notify_event", hover)
plt.show()

解説

%matplotlib widgetでmatplotllibでの対話的な操作を可能としている。この指定を絶対忘れないようにしよう。
imagebox = OffsetImage(images[0], zoom=1.0)のところで画像を生成し、その下でAnnotationBboxに食わせることによって、画像表示用のアノテーションボックスを生成している。
同様にtextbox = TextArea("Test")でテキストエリアを生成し、その下でAnnotationBboxに食わせることによって、化合物名表示用のアノテーションボックスを生成している。
sc = plt.scatter(Z[:, 0], Z[:, 1], c="r", alpha=0.7)で通常のやり方でグラフを表示している。scという変数に戻り値のPathCollectionを保持している理由は、後の処理でマウスオーバしたプロットを判定する際に使うためである。
def update_annotation(index):関数にはある点がホバーされたときの処理を記載している。ここでは画像表示用アノテーションの座標と画像および、テキストエリアの化合物名を変更している。※今回テキストエリアの位置は固定としている。またif pos[0] < 100:の分岐は、右端にいくと画像が切れて見えなくなるので、プロットの左側に表示させるための調整用である。グラフに合わせて数値は調整してほしい。
def hover(event):関数は、マウスが動いた際に呼ばれる関数であり、 fig.canvas.mpl_connect("motion_notify_event", hover)で紐づけられている。ホバーされた点があるか判定し、ある場合はインデックスを引数としてupdate_annotation関数を呼び出している。

おわりに

この方法でガンガンデータ分析をして、探索的データ分析を楽しみたい。

追記(2021/11/2)　ホバーが表示されないとき

化学構造式画像や化合物名が表示されないときは、以下が原因の可能性がある。

化合物名が表示されない場合、annotationBoxText = AnnotationBbox(textbox, xy=(0, 0), xybox=(0.1, 50), の xybox=(0.1, 50), の値をずらすとよい。一旦0,0にしてそこから調整してみるとよい。
化学構造式画像が表示されない場合、 if pos[0] < 100: 以下の条件分岐と表示のところが実際のグラフのサイズにあっていない可能性がある。この場合も条件分岐なしで、annotationBoxImage.xy = (pos[0], pos[1]) 　で表示させてみて徐々に調整してみるとよい。
グラフの下にx座標、y座標が表示されるので、上2つの調整はその座標を見ながらやるとよい。

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up