More than 1 year has passed since last update.

結晶構造のcifデータから非対称単位だけの原子座標を抽出してみたよ

Last updated at 2024-09-03Posted at 2024-08-30

結晶構造の非対称単位

結晶中では原子や分子は三次元的に規則正しく配列しています。この最小の繰り返し単位が結晶構造です。結晶構造はほとんどの場合(空間群が$P_1$以外)何かしらの対称性があります。結晶構造に含まれる原子や分子はその対称性によって関係づけることができます。ここで、結晶構造の対称性によって関係づけることのできない最小の要素を非対称単位と言います¹。アスピリンの有機結晶を例に見てみます²。結晶構造中にはアスピリンの分子が四つありますが、このうち一つのアスピリンの分子の座標が分かれば、他の三つの分子の座標は、結晶格子の対称性によって一義的に定まります。つまりアスピリンの有機結晶の非対称単位は一分子のアスピリンということになります。

非対称単位と結晶の対称性が分かれば結晶構造の全体がわかるので、国際的な結晶構造のデータのフォーマットであるcifでも原子座標は非対称単位のみ記載されています。有機結晶においては、結晶構造から一分子だけを取り出したい時が多々あります。したがって、アスピリンのような非対称単位に一分子だけが含まれる結晶では、cifデータから非対称単位の原子座標を取り出したいと思うわけです。

Pythonでcifデータを読み込む

Pythonでcifデータを読み込むのはpymatgenというモジュールが有名です。昨今、結晶構造を活用したデータサイエンスが流行っており、そこでpymatgenが活用されています。pymatgenを用いると次のようにしてcifファイルから結晶構造中の原子座標を取得できます。

from pymatgen.core import Structure
from pymatgen.io.cif import CifParser

# CIFファイルのパス
cif_file_path = "./1515581.cif"

# CIFファイルをパースして構造を取得
parser = CifParser(cif_file_path)
structure = parser.get_structures()[0]

# 原子の座標を抽出
atom_coords = structure.frac_coords

# 結果を表示
for i, coord in enumerate(atom_coords):
    print(f"Atom {i+1}: {coord}")

ただ、これには問題があり、非対称単位だけでなく結晶中の全部の原子の座標を取ってきてしまいます。自分の調べた限りではcifから非対称単位のみを簡単に取り出すモジュールは見つかりませんでした(もしご存知の方がいたらコメント頂けますと幸いです)。そこで、仕方がないのでcifデータから非対称単位だけの原子座標を抽出するプログラムを書きました。

cifデータから非対称単位だけの原子座標を抽出するプログラム

import re

# CIFファイルのパス
cif_file_path = "./1515581.cif"

with open(cif_file_path, "r") as f:
    lines = f.readlines()

def remove_parentheses(text):
    return re.sub(r'\(.*?\)', '', text)

def process_cif_posi(lines):
    # 各行の先頭のスペースと最後の\nを除去
    lines = [line.strip() for line in lines]
    lines = [line.lstrip(" ") for line in lines]

    # "loop_"の次の行が"_atom_site_label"であるかを判定
    for i, line in enumerate(lines):
        if line.startswith("loop_"):
            if lines[i + 1].startswith("_atom_site_label"):
                
                # "_atom_site_label" から数えて次の先頭が "_" でない行までの行数をカウント = 非対称単位の原子数
                count = 0
                for j in range(i + 1, len(lines)):
                    if not lines[j].startswith("_"):
                        break
                    count += 1
                
                # "_atom_site_type_symbol" が何行目にあるかカウント
                atom_site_type_symbol_index = lines.index("_atom_site_type_symbol", i + 1)
                atom_site_label_index = lines.index("_atom_site_label", i + 1)
                type_symbol_pos = atom_site_type_symbol_index - atom_site_label_index
                
                # "_atom_site_fract_x" が何行目にあるかカウント
                atom_site_fract_x_index = lines.index("_atom_site_fract_x", i + 1)
                fract_x_pos = atom_site_fract_x_index - atom_site_label_index
                
                # "_atom_site_fract_y" が何行目にあるかカウント
                atom_site_fract_y_index = lines.index("_atom_site_fract_y", i + 1)
                fract_y_pos = atom_site_fract_y_index - atom_site_label_index
                
                # "_atom_site_fract_z" が何行目にあるかカウント
                atom_site_fract_z_index = lines.index("_atom_site_fract_z", i + 1)
                fract_z_pos = atom_site_fract_z_index - atom_site_label_index
                
                # 先頭が"_"でない行を次の"_"が含まれる行が出る一行前まで取得してリスト化
                data_lines = []
                for k in range(j, len(lines)):
                    if lines[k].startswith("_"):
                        break
                    data_lines.append(lines[k].strip())
                    
                data_lines = data_lines[:-1]
                data_lines = [element for element in data_lines if element != ""]
                
                data_lines = [[data.split(" ")[0], data.split(" ")[type_symbol_pos], data.split(" ")[fract_x_pos], data.split(" ")[fract_y_pos], data.split(" ")[fract_z_pos]] for data in data_lines]
                
                data_lines = [[remove_parentheses(item) for item in row] for row in data_lines]

                break
        
    return data_lines

atom_posi = process_cif_posi(lines)
    
print(atom_posi)

これで出力されるatom_posiは次のような二次元のlistになってます。各値は[原子のラベル、元素記号、x座標、y座標、z座標]です。ここで注意したいことは、ここで得られるx座標、y座標、z座標は結晶の分率座標で、直交座標系の値ではありません。

[['O1', 'O', '0.62354', '0.1419', '0.61266'], 
['H1', 'H', '0.577', '0.025', '0.602'], 
['O2', 'O', '0.50970', '0.1879', '0.40933'], 
['O3', 'O', '0.78878', '0.41876', '0.72945'], 
['O4', 'O', '0.90669', '0.2189', '0.66325'], 
['C1', 'C', '0.65418', '0.4419', '0.50750'], 
['C2', 'C', '0.74806', '0.5203', '0.61181'], 
['C3', 'C', '0.80069', '0.7061', '0.60519'], 
['H3', 'H', '0.8632', '0.7580', '0.6776'], 
['C4', 'C', '0.76267', '0.8174', '0.49303'], 
['H4', 'H', '0.7997', '0.9449', '0.4882'], 
['C5', 'C', '0.67053', '0.7425', '0.38781'], 
['H5', 'H', '0.6444', '0.8182', '0.3106'], 
['C6', 'C', '0.61699', '0.5578', '0.39559'], 
['H6', 'H', '0.5534', '0.5081', '0.3233'], 
['C7', 'C', '0.59016', '0.2458', '0.50601'], 
['C8', 'C', '0.86823', '0.2623', '0.74362'], 
['C9', 'C', '0.89840', '0.1611', '0.86922'], 
['H7', 'H', '0.9545', '0.0480', '0.8770'], 
['H8', 'H', '0.8259', '0.1059', '0.8769'], 
['H9', 'H', '0.9347', '0.2625', '0.9366']]

非対称単位の座標が得られましたので、分子の形状を解析したり距離を測ったりを自動化できそうですね。以上です。

非対称単位の説明 ↩
Crystallography Open Database(COD)から無料でダウンロードできます。ファイル名は1515581.cifです。出典: Varughese, Sunil, et al. Chemical Science 2.11 (2011): 2236-2242. ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up