アンモニウムイオンを含むかどうかの判定アルゴリズム作成し，複数のPOSCARファイルに適用（ログ）

Last updated at 2024-01-09Posted at 2023-12-25

目標

POSCARファイルから，アンモニウムイオンを含むどうかを判定するアルゴリズムを作成する．

前準備

0-1．元素種N, Hを含むPOSCARファイルから，POSCAR.nnlistを作成し，NとHの結合距離の分布図を描く．
0-2．0-1．よりアンモニウムイオンのNH間の結合距離の最大値を推定する．（注）

結合探索アルゴリズム

1．POSCARファイルが元素種N, Hを含む
2．POSCAR.nnlistにおいて，原子Nから0-2．で推定したNH結合距離以内に，原子Nに対するNeigborsListに原子Hを4つ以上含む，中心原子Nが存在する．

####### 条件１：原子Nから原子Hに対して，結合手がちょうど4本生えているか？ #######
3．2．の中心原子Nに対して，
1番近い原子がHであり，かつ2番目・3番目・4番目に近い原子もHである，中心原子Nが存在する．
4．2．の中心原子Nに対して，5番目に近い原子が
存在しない場合 → 条件１をクリア．
存在する場合 → 5．に進む．
5．2．の中心原子Nに対して，5番目に近い原子とHとの距離が，4番目に近い原子（：酸素H ∵手順3．）とのNH距離より大きい．
→ 条件1をクリア．

####### 条件2：N周りの4つのHすべてが，NH結合距離内において，中心のN以外と結合していないか？ #######
6．3．の4つの原子H全てに対して，3．の中心の原子Nとの距離以内に，中心原子N以外の別の原子が存在しない．
→ 条件2をクリア．

→ 「条件1かつ条件2を満たす．⇔ アンモニウムイオンを含む」．？

前準備ためのプログラムの実行（流れ）

0-1-1．cif/ディレクトリ内の，元素種N, Hを含むPOSCARファイルパスをリストで取得．

元素種C，Oを含むPOSCARファイルのパスのみ抽出（ログ） #Python3 - Qiita
https://qiita.com/k-morii-toridai/items/1583fb78e1fde205cebc

0-1-2．0-1-1．で得たPOSCAR.nnlistから，原子Cと原子Oの距離を抽出し，NumPyのndarrayに格納．
（注）POSCAR.nnlistはPandasのDataFrame化して処理する．
0-1-3．0-1-2．で得たndarrayを1次元にflattenし，ヒストグラムを描画．

0-2．0-1-3．よりアンモニウムイオンのNH間の結合距離の最大値を推定する．
参考として，アンモニウムイオンの熱化学半径は1.51Åと書かれている．
一方でヒストグラムから，0.9Å付近に孤立した山形状の分布が見られる．距離が近い原子同士が結合を生成いることから，ここに窒素水素間結合距離に対応すると考えられる．
炭素酸素間結合距離の閾値として，1.2Åとするのが妥当だと考えられる．
これらより，目安となる炭素酸素間結合距離の最大値を1.65Åと仮定して進める．

アルゴリズムの実装

import numpy as np


def filter_2(df_nnlist):
    """
    2．POSCAR.nnlistにおいて，原子Nから0-2．のNH結合距離以内に，原子Hを4つ以上含む，中心原子Nが存在するかどうか判定．
    → 存在する場合、True値，{中心原子Cの'central_atom_id': そのneighborsの'central_atom_id'}の辞書の2つを返す．
    → 存在しない場合，False値，空の辞書を返す．

    Usage:
    ------
    bool_2, dict_2 = filter_2(df_nnlist=df_nnlist)

    Parameters:
    -----------
    df_nnlist: pd.DataFrame

    Returns:
    --------
    bool_2: bool
    dict_2: dict
    """
    df_nnlist_group_dict = df_nnlist[df_nnlist['central_atom_symbol'] == 'N'].groupby('central_atom_id').groups
    df_nnlist_central_atom_ids = np.array(list(df_nnlist_group_dict.keys()))
    bool_list = []
    for key in df_nnlist_central_atom_ids:
        bool_sort_dist_list = df_nnlist.iloc[df_nnlist_group_dict[key]].sort_values('rel_distance')['rel_distance'] < 1.2
        bool_list.append(df_nnlist.iloc[df_nnlist_group_dict[key]].sort_values('rel_distance')[bool_sort_dist_list]['neighboring_atom_symbol'].tolist().count('H') >= 4)
    df_nnlist_central_atom_ids_fillterd = df_nnlist_central_atom_ids[bool_list]
    bool_filter_2 = len(df_nnlist_central_atom_ids_fillterd) >= 1
    filtered_df_nnlist_group_dict = {key: df_nnlist_group_dict[key] for key in df_nnlist_group_dict.keys() if key in df_nnlist_central_atom_ids_fillterd}

    return bool_filter_2, filtered_df_nnlist_group_dict


def filter_3(df_nnlist, dict_2):
    """
    3．2．の中心原子Nに対して，1番近い原子がHであり，かつ2番目・3番目・4番目に近い原子もHである，中心原子Nが存在するかどうか判定．
    → 存在する場合，True値，{中心原子Cの'central_atom_id': そのneighborsの'central_atom_id'}の辞書の2つを返す．
    → 存在しない場合，False値，空の辞書を返す．

    Usage:
    ------
    bool_3, dict_3 = filter_3(df_nnlist=df_nnlist, dict_2=dict_2)

    Parameters:
    -----------
    df_nnlist: pd.DataFrame
    dict_2: dict

    Returns:
    --------
    bool_3: bool
    dict_3: dict
    """
    bool_list_3 = []
    for k, v in dict_2.items():
        bool_list_3.append(set(df_nnlist.iloc[dict_2[k]].sort_values(by='rel_distance')['neighboring_atom_symbol'].tolist()[1:5]) == {'H'})
    df_nnlist_central_atom_ids_fillterd_3 = np.array(list(dict_2.keys()))[bool_list_3]
    filtered_3_df_nnlist_group_dict = {key: dict_2[key] for key in dict_2.keys() if key in df_nnlist_central_atom_ids_fillterd_3}
    bool_filter_3 = len(df_nnlist_central_atom_ids_fillterd_3) >= 1

    return bool_filter_3, filtered_3_df_nnlist_group_dict


def filter_4(df_nnlist, dict_3):
    """
    4．2．の中心原子Nに対して5番目に近い原子が存在しないかを判定．
    → 存在しない場合，True値，{中心原子Nの'central_atom_id': そのneighborsの'central_atom_id'}の辞書の2つを返す．
    → 存在する場合，False値，空の辞書の2つを返す．

    Usage:
    ------
    bool_4, dict_4 = filter_4(df_nnlist=df_nnlist, dict_3=dict_3)

    Parameters:
    -----------
    df_nnlist: pd.DataFrame
    dict_3: dict

    Returns:
    --------
    bool_4: bool
    dict_4: dict
    """
    bool_list_4 = []
    for k, v in dict_3.items():
        bool_list_4.append(len(df_nnlist.iloc[dict_3[k]].sort_values(by='rel_distance')['neighboring_atom_symbol'].tolist()) == 5)
    df_nnlist_central_atom_ids_fillterd_4 = np.array(list(dict_3.keys()))[bool_list_4]
    filtered_4_df_nnlist_group_dict = {key: dict_3[key] for key in dict_3.keys() if key in df_nnlist_central_atom_ids_fillterd_4}
    bool_filter_4 = len(df_nnlist_central_atom_ids_fillterd_4) >= 1

    return bool_filter_4, filtered_4_df_nnlist_group_dict


def filter_5(df_nnlist, dict_3):
    """
    5．2．の中心原子Nに対して5番目に近い原子が，Nに4番目に近い原子HとNのNH距離より大きいNが存在するどうかを判定．
    → 存在する場合，True値，{中心原子Cの'central_atom_id': そのneighborsの'central_atom_id'}の辞書の2つを返す．
    → 存在しない場合，False値，空の辞書を返す．

    Usage:
    ------
    bool_5, dict_5 = filter_5(df_nnlist=df_nnlist, dict_3=dict_3)

    Parameters:
    -----------
    df_nnlist: pd.DataFrame
    dict_3: dict

    Returns:
    --------
    bool_5: bool
    dict_5: dict
    """
    bool_list_5 = []
    for k, v in dict_3.items():
        fourth_NH_bond_dist = df_nnlist.iloc[dict_3[k]].sort_values(by='rel_distance')['rel_distance'].tolist()[4]
        fifth_dist = df_nnlist.iloc[dict_3[k]].sort_values(by='rel_distance')['rel_distance'].tolist()[5]
        bool_list_5.append(fourth_NH_bond_dist < fifth_dist)
    df_nnlist_central_atom_ids_fillterd_5 = np.array(list(dict_3.keys()))[bool_list_5]
    filtered_5_df_nnlist_group_dict = {key: dict_3[key] for key in dict_3.keys() if key in df_nnlist_central_atom_ids_fillterd_5}
    bool_filter_5 = len(df_nnlist_central_atom_ids_fillterd_5) >= 1

    return bool_filter_5, filtered_5_df_nnlist_group_dict


def filter_6(df_nnlist, dict_3):
    """
    6．3．の4つの原子H全てに対して，3．の中心の原子Nとの距離以内に，中心原子N以外の別の原子が存在しないかどうかを判定．
    → 存在しない場合，True値，中心原子Cの'central_atom_id'のndarrayの2つを返す．
    → 存在する場合，False値，空のndarrayを返す．

    Usage:
    ------
    bool_6, N_ids = filter_6(df_nnlist=df_nnlist, dict_3=dict_3)

    Parameters:
    -----------
    df_nnlist: pd.DataFrame
    dict_3: dict

    Returns:
    --------
    bool_6: bool
    N_ids: ndarray
    """
    bool_list_6 = []
    for k, v in dict_3.items():
        # N周りのH4つのindex
        indices = df_nnlist.iloc[dict_3[k]].sort_values(by='rel_distance').index[1:5]
        H_ids = df_nnlist.iloc[indices].apply(lambda row: row['neighboring_atom_id'], axis=1).tolist()
        bool_list_temp = []
        for H_id in H_ids:
            bool_temp = df_nnlist[df_nnlist['central_atom_id'] == H_id].sort_values('rel_distance')['neighboring_atom_symbol'].tolist()[1] == 'N'
            bool_list_temp.append(bool_temp)
        if set(bool_list_temp) == {True}:
            bool_list_6.append(True)
        else:
            bool_list_6.append(False)
    H_ids = np.array(list(dict_3.keys()))[bool_list_6]
    bool_filter_6 = bool_list_6.count(True) >= 1

    return bool_filter_6, H_ids


def concat_filter(df_nnlist):
    """
    filter_2()~filter_6()の関数を用いて，POSCAR.nnlistを用いて，POSCARファイルに炭酸イオンを含むかどうかの判定algolismを作成．
    → True値が返された場合，炭酸イオンを含む．
    → False値が返され場合，炭酸イオンを含まない．

    Usage:
    ------
    concat_filter(df_nnlist=df_nnlist)

    Parameters:
    -----------
    df_nnlist: pd.DataFrame

    Returns:
    --------
    bool: True or False
    """
    bool_2, dict_2 = filter_2(df_nnlist=df_nnlist)
    if bool_2:
        bool_3, dict_3 = filter_3(df_nnlist=df_nnlist, dict_2=dict_2)
        if bool_3:
            bool_4, dict_4 = filter_4(df_nnlist=df_nnlist, dict_3=dict_3)
            if bool_4:
                return True
            else:
                bool_5, dict_5 = filter_5(df_nnlist=df_nnlist, dict_3=dict_3)
                if bool_5:
                    bool_6, N_ids = filter_6(df_nnlist=df_nnlist, dict_3=dict_3)
                    if bool_6:
                        return True
                    else:
                        return False
                else:
                    return False
        else:
            return False
    else:
        return False


if __name__ == '__main__':
    from package_file_conversion.nnlist2df import nnlist2df
    nnlist_path = 'sample_test_files/1000033/nnlist_5/POSCAR.nnlist'
    df_nnlist = nnlist2df(nnlist_path=nnlist_path)
    print(f"concat_filter(df_nnlist=df_nnlist): {concat_filter(df_nnlist=df_nnlist)}")

プログラムの実行（流れ）

1．cif/ディレクトリ内の，元素種N, Hを含むPOSCARファイルパスをリストで取得．

pwd
/mnt/ssd_elecom_c2c_960gb/scripts/get_some_speceis_existed_poscar_path_list

time python3 get_some_speceis_existed_poscar_path_list.py N H
ls -CF

len(poscar_abs_path_list_loaded): 308325
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 308325/308325 [11:41<00:00, 439.55it/s]
len(N_H_existed_poscar_file_path_list): 210889
N_H_existed-poscar file, and folder path list were saved as .npy!!!

real 11m54.054s
user 3m47.663s
sys 8m13.081s

ls -CF
C_O_existed_poscar_file_path_list.npy*    N_H_existed_poscar_folder_path_list.npy*       get_some_speceis_existed_poscar_path_list.py.sh*
C_O_existed_poscar_folder_path_list.npy*  get_poscar_existed_path_list.py*               poscar_existed_file_path_list.npy*
N_H_existed_poscar_file_path_list.npy*    get_some_speceis_existed_poscar_path_list.py*  poscar_existed_folder_path_list.npy*

pwd
/mnt/ssd_elecom_c2c_960gb/scripts/get_some_ion_contained_pos_folder_p_list

ls -CF
CO3_contained_poscar_folder_path_list.npy*                             get_NH4_contained_pos_folder_p_list.py*
CO3_contained_poscar_folder_path_list_minus_filter_6.npy*  get_CO3_contained_pos_folder_p_list.py*                 package_bond_search_algorithm/
NH4_contained_poscar_folder_path_list.npy*                 get_CO3_contained_pos_folder_p_list_minus_filter_6.py*

time python3 get_NH4_contained_pos_folder_p_list.py

len(N_H_existed_poscar_folder_path_list): 210889
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 210889/210889 [43:13<00:00, 81.31it/s]
len(ion_contained_poscar_folder_path_list)/len(N_H_existed_poscar_folder_path_list) :799/210889

real 43m19.905s
user 174m53.191s
sys 10m58.492s

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up