More than 5 years have passed since last update.

ケモインフォマティクスで学ぶPandas

Last updated at 2020-04-01Posted at 2020-01-19

はじめに

ケモインフォマティクスで学ぶNumPyに続き、リピドミクス（脂質の網羅解析）を題材として、Pythonの代表的なライブラリの一つである「Pandas」について解説していきます。
ケモインフォマティクスの実践例を中心に説明していきますので、基本を確認したいという人は以下の記事を読んでからこの記事を読んでみてください。

製薬企業研究者がPandasについてまとめてみた

SeriesとDataFrameの作成

Pandasを使うことで、表計算を簡単に行えるようになります。

ライブラリを利用するには、まずimportでライブラリを読み込みます。
慣習的に、pdと略すことが多いです。

Pandasでは、「Series」と「DataFrame」の2種類のデータ構造を扱います。

Seriesは1次元のデータで、リストや辞書に似たデータ構造になります。

import pandas as pd


index_fatty_acids = ['FA 16:0', 'FA 16:1', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3', 'FA 18:4', 'FA 20:0', 'FA 20:3', 'FA 20:4', 'FA 20:5']

numbers_carbon = pd.Series([16, 16, 18, 18, 18, 18, 18, 20, 20, 20, 20], index=index_fatty_acids)
numbers_unsaturation = pd.Series([0, 1, 0, 1, 2, 3, 4, 0, 3, 4, 5], index=index_fatty_acids)

print(numbers_carbon)
print(numbers_unsaturation)

上の例で、index_fatty_acidsは、データの名称のようなものだと思ってもらえればいいと思います。
下のように、辞書をもとにSeriesを作ることもできます。index_fatty_acidsを辞書のキーとします。
ただ、コードが長くなってしまうので、index=でリストを指定する方が良いでしょう。

import pandas as pd


numbers_carbon = pd.Series({
    'FA 16:0': 16,
    'FA 16:1': 16,
    'FA 18:0': 18,
    'FA 18:1': 18,
    'FA 18:2': 18,
    'FA 18:3': 18,
    'FA 18:4': 18,
    'FA 20:0': 20,
    'FA 20:3': 20,
    'FA 20:4': 20,
    'FA 20:5': 20
})

print(numbers_carbon)

一方、DataFrameは、Seriesを結合して作られる2次元のデータになります。

import pandas as pd


index_fatty_acids = ['FA 16:0', 'FA 16:1', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3', 'FA 18:4', 'FA 20:0', 'FA 20:3', 'FA 20:4', 'FA 20:5']

numbers_carbon = pd.Series([16, 16, 18, 18, 18, 18, 18, 20, 20, 20, 20], index=index_fatty_acids)
numbers_unsaturation = pd.Series([0, 1, 0, 1, 2, 3, 4, 0, 3, 4, 5], index=index_fatty_acids)

df_fatty_acids = pd.DataFrame({'Cn': numbers_carbon, 'Un': numbers_unsaturation})

print(df_fatty_acids)

上の例で、pd.DataFrameの中にある辞書のキーは、表の列名にあたります。
一方、先ほどSeriesを作成した時に指定したindexは、行名になります。
ちなみに、df_fatty_acidsのdfは、　「dataframe」の略です。

データの参照

DataFrameの行名と列名を参照するには、それぞれindexとcolumnsを用います。

import pandas as pd


index_fatty_acids = ['FA 16:0', 'FA 16:1', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3', 'FA 18:4', 'FA 20:0', 'FA 20:3', 'FA 20:4', 'FA 20:5']

numbers_carbon = pd.Series([16, 16, 18, 18, 18, 18, 18, 20, 20, 20, 20], index=index_fatty_acids)
numbers_unsaturation = pd.Series([0, 1, 0, 1, 2, 3, 4, 0, 3, 4, 5], index=index_fatty_acids)

df_fatty_acids = pd.DataFrame({'Cn': numbers_carbon, 'Un': numbers_unsaturation})

print(df_fatty_acids.index) # 行名
print(df_fatty_acids.columns) # 列名

DataFrameの特定の要素にアクセスするには、以下のように書きます。

import pandas as pd


index_fatty_acids = ['FA 16:0', 'FA 16:1', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3', 'FA 18:4', 'FA 20:0', 'FA 20:3', 'FA 20:4', 'FA 20:5']

numbers_carbon = pd.Series([16, 16, 18, 18, 18, 18, 18, 20, 20, 20, 20], index=index_fatty_acids)
numbers_unsaturation = pd.Series([0, 1, 0, 1, 2, 3, 4, 0, 3, 4, 5], index=index_fatty_acids)

df_fatty_acids = pd.DataFrame({'Cn': numbers_carbon, 'Un': numbers_unsaturation})

print(df_fatty_acids['Cn']) # 列名を指定
print(df_fatty_acids.Cn) # 列名を指定

print(df_fatty_acids['Cn'][0]) # 列名を指定し、行番号を指定

print(df_fatty_acids[2:5]) # 行番号（インデックス番号）をスライスで指定
print(df_fatty_acids[5:]) # 指定した行番号以降のデータを抽出
print(df_fatty_acids[:5]) # 指定した行番号までのデータを抽出
print(df_fatty_acids[-5:]) # 後ろから数えた行番号
print(df_fatty_acids[2:5]['Cn']) # 行番号と列名を指定

print(df_fatty_acids.loc['FA 16:0', 'Cn']) # 行名と列名を指定
print(df_fatty_acids.loc['FA 16:0']) # 行名を指定
print(df_fatty_acids.loc[:, 'Cn']) # 列名を指定

print(df_fatty_acids.iloc[0, 0]) # 行番号と列番号を指定
print(df_fatty_acids.iloc[0]) # 行番号を指定
print(df_fatty_acids.iloc[:, 0]) # 列番号を指定
print(df_fatty_acids.iloc[-1, -1]) # 最後の行の最後の列の要素

行名や列名を指定するのが良いのか、行番号や列番号を指定するのか良いのかは、ケースバイケースなので、その都度やりやすい方を選ぶのが良いでしょう。

また、指定した条件を満たすデータを抽出することもできます。

import pandas as pd


index_fatty_acids = ['FA 16:0', 'FA 16:1', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3', 'FA 18:4', 'FA 20:0', 'FA 20:3', 'FA 20:4', 'FA 20:5']

numbers_carbon = pd.Series([16, 16, 18, 18, 18, 18, 18, 20, 20, 20, 20], index=index_fatty_acids)
numbers_unsaturation = pd.Series([0, 1, 0, 1, 2, 3, 4, 0, 3, 4, 5], index=index_fatty_acids)

df_fatty_acids = pd.DataFrame({'Cn': numbers_carbon, 'Un': numbers_unsaturation})

print(df_fatty_acids[df_fatty_acids['Cn'] >= 18]) # 条件を満たす行を抽出したDataFrame
print(df_fatty_acids[df_fatty_acids['Cn'] >= 18]['Cn']) # 条件を満たす行を抽出したDataFrameで列名を指定して抽出
print(df_fatty_acids[df_fatty_acids['Cn'] >= 18].iloc[:, 0]) # 条件を満たす行を抽出したDataFrameで列番号を指定して抽出

print(df_fatty_acids[(df_fatty_acids['Cn'] >= 18) & (df_fatty_acids['Un'] >= 2)]) # 複数の条件を指定（and）
print(df_fatty_acids[(df_fatty_acids['Cn'] >= 18) | (df_fatty_acids['Un'] >= 1)]) # 複数の条件を指定（or）

複数の条件を指定する時は、条件ごとにかっこ()が必要です。忘れないようにしましょう。

データの追加

DataFrame名['列名']で、特定の列を指定できますが、DataFrameにない列名を指定した場合、新たに列が作成されます。
また、別の列のデータをもとに、計算することも簡単にできます。

import pandas as pd


index_fatty_acids = ['FA 16:0', 'FA 16:1', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3', 'FA 18:4', 'FA 20:0', 'FA 20:3', 'FA 20:4', 'FA 20:5']

numbers_carbon = pd.Series([16, 16, 18, 18, 18, 18, 18, 20, 20, 20, 20], index=index_fatty_acids)
numbers_unsaturation = pd.Series([0, 1, 0, 1, 2, 3, 4, 0, 3, 4, 5], index=index_fatty_acids)

df_fatty_acids = pd.DataFrame({'Cn': numbers_carbon, 'Un': numbers_unsaturation})

df_fatty_acids['C'] = df_fatty_acids['Cn']
df_fatty_acids['H'] = df_fatty_acids['Cn'] * 2 - df_fatty_acids['Un'] * 2
df_fatty_acids['O'] = 2

print(df_fatty_acids)

上の例では、CnとUnの列の値をもとに、各脂肪酸分子種の炭素原子数Cと水素原子数Hを計算しています。
df_fatty_acidsを出力すると、新たにCとHとOの列が追加されていることが分かります。

算術計算

次に、各脂肪酸分子種の精密質量を求めることを考えてみます。

import pandas as pd


index_fatty_acids = ['FA 16:0', 'FA 16:1', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3', 'FA 18:4', 'FA 20:0', 'FA 20:3', 'FA 20:4', 'FA 20:5']

numbers_carbon = pd.Series([16, 16, 18, 18, 18, 18, 18, 20, 20, 20, 20], index=index_fatty_acids)
numbers_unsaturation = pd.Series([0, 1, 0, 1, 2, 3, 4, 0, 3, 4, 5], index=index_fatty_acids)

df_fatty_acids = pd.DataFrame({'Cn': numbers_carbon, 'Un': numbers_unsaturation})

df_fatty_acids['C'] = df_fatty_acids['Cn']
df_fatty_acids['H'] = df_fatty_acids['Cn'] * 2 - df_fatty_acids['Un'] * 2
df_fatty_acids['O'] = 2

df_fatty_acids['Exact mass'] = pd.Series([0] * len(index_fatty_acids), index=index_fatty_acids) # とりあえず全ての行に0を入れておく

exact_masses = pd.Series({'C': 12, 'H': 1.00783, 'O': 15.99491})

for atom in exact_masses.index:
    df_fatty_acids['Exact mass'] += exact_masses[atom] * df_fatty_acids[atom] # 精密質量を計算

print(df_fatty_acids)

上の例では、各原子の精密質量に原子数をかけたものを足し合わせることで脂肪酸分子の精密質量を求めています。

文字列の結合

次に、組成式を求めてみます。

import pandas as pd


index_fatty_acids = ['FA 16:0', 'FA 16:1', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3', 'FA 18:4', 'FA 20:0', 'FA 20:3', 'FA 20:4', 'FA 20:5']

numbers_carbon = pd.Series([16, 16, 18, 18, 18, 18, 18, 20, 20, 20, 20], index=index_fatty_acids)
numbers_unsaturation = pd.Series([0, 1, 0, 1, 2, 3, 4, 0, 3, 4, 5], index=index_fatty_acids)

df_fatty_acids = pd.DataFrame({'Cn': numbers_carbon, 'Un': numbers_unsaturation})

df_fatty_acids['C'] = df_fatty_acids['Cn']
df_fatty_acids['H'] = df_fatty_acids['Cn'] * 2 - df_fatty_acids['Un'] * 2
df_fatty_acids['O'] = 2

df_fatty_acids['Molecular formula'] = pd.Series([''] * len(index_fatty_acids), index=index_fatty_acids) # とりあえず全ての行に空文字を入れておく

exact_masses = pd.Series({'C': 12, 'H': 1.00783, 'O': 15.99491})

for atom in exact_masses.index:
    df_fatty_acids['Molecular formula'] += atom + df_fatty_acids[atom].astype(str) # 組成式をかく
    
print(df_fatty_acids)

元素記号と原子数を文字列として結合させれば良いわけですが、C、H、Oに入っているデータは数値のため、結合する前に文字列に変換する必要があります。
そこで、上の例ではastype(str)として、数値を文字列に変換した上で結合しているわけです。

外部ファイルへの出力

次は出来上がったデータを外部ファイルとして出力することを考えます。

import pandas as pd


index_fatty_acids = ['FA 16:0', 'FA 16:1', 'FA 18:0', 'FA 18:1', 'FA 18:2', 'FA 18:3', 'FA 18:4', 'FA 20:0', 'FA 20:3', 'FA 20:4', 'FA 20:5']

numbers_carbon = pd.Series([16, 16, 18, 18, 18, 18, 18, 20, 20, 20, 20], index=index_fatty_acids)
numbers_unsaturation = pd.Series([0, 1, 0, 1, 2, 3, 4, 0, 3, 4, 5], index=index_fatty_acids)

df_fatty_acids = pd.DataFrame({'Cn': numbers_carbon, 'Un': numbers_unsaturation})

df_fatty_acids['C'] = df_fatty_acids['Cn']
df_fatty_acids['H'] = df_fatty_acids['Cn'] * 2 - df_fatty_acids['Un'] * 2
df_fatty_acids['O'] = 2

df_fatty_acids['Exact mass'] = 0

exact_masses = pd.Series({'C': 12, 'H': 1.00783, 'O': 15.99491})

df_fatty_acids['Exact mass'] = exact_masses * df_fatty_acids # 精密質量

for atom in exact_masses.index:
    df_fatty_acids['Molecular formula'] += atom + df_fatty_acids[atom].astype(str) # 組成式

df_fatty_acids.to_csv('fatty_acids.csv') # CSVファイルとして出力
df_fatty_acids.to_csv('fatty_acids.txt', sep='\t') # タブ区切りテキストファイルとして出力
df_fatty_acids.to_excel('fatty_acids.xlsx', sheet_name='fatty_acids') # エクセルファイルとして出力

外部ファイルの読み込み

逆に、外部ファイルを読み込むには、以下のようにします。

import pandas as pd


df_csv = pd.read_csv('fatty_acids.csv', index_col=0) #CSVファイルを読み込み
df_text = pd.read_csv('fatty_acids.txt', sep='\t', index_col=0) # タブ区切りテキストファイルを読み込み
df_excel = pd.read_excel('fatty_acids.xlsx', index_col=0) # エクセルファイルを読み込み

print(df_csv)
print(df_text)
print(df_excel)

DataFrameの最初あるいは最後の数行のみを読み込むには以下のようにします。

import pandas as pd


df_csv = pd.read_csv('fatty_acids.csv', index_col=0) #CSVファイルを読み込み
df_text = pd.read_csv('fatty_acids.txt', sep='\t', index_col=0) # タブ区切りテキストファイルを読み込み
df_excel = pd.read_excel('fatty_acids.xlsx', index_col=0) # エクセルファイルを読み込み

print(df_csv.head()) # 最初の5行を表示
print(df_csv.head(3)) # 最初の3行を表示
print(df_csv.tail()) # 最後の5行を表示

headは、最初の指定した行数のデータを、tailは最後の指定した行数のデータを抽出します。
行数を指定しなかった場合は、デフォルトで5行分が表示されます。
df_textやdf_excelについても同様です。

以上のように、外部ファイルをDataFrameとして読み込んで、どのような形でデータが格納されているか分かれば、特定の行や列を抽出したり、新しい列を追加して計算したりして、出来上がった表を出力する、というのがデータ分析の基本的な流れになります。

まとめ

ここでは、Pandasについて、ケモインフォマティクスで使える実践的な知識を中心に解説しました。
もう一度要点をおさらいしておきましょう。

Pandasでは、SeriesとDataFrameの2種類のデータ構造を扱えます。
特定の行や列を抽出したり、条件を満たすデータのみを取り出したりなど、データベース操作のような処理ができます。
外部ファイルを読み込んだり、出力したりすることもできます。

続いて、Matplotlibについて以下の記事で解説しています。

ケモインフォマティクスで学ぶMatplotlib

参考資料・リンク

プログラミング言語Pythonとは？AIや機械学習に使える？

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up