More than 3 years have passed since last update.

Pandas で○○したい

Last updated at 2021-01-18Posted at 2020-07-06

更新履歴

2020/07/08
データの表示形式を CSV から Markdown Table に変更しました。
CSV から Markdown Table への変換には CSV to Markdown Table Generator を使用しました。
2021/01/18
欠損値を補完する方法を追記しました。

Pandas で「あれ、〇〇したいときどうすれば良いんだっけ？」となることが多いので、用途別にまとめます。

前提

今回のサンプルコードでは、
Kaggle社が提供しているタイタニック号の生存者リスト(train.csv)を
pandas.read_csv() で読み込んで使用します。

Titanic: Machine Learning from Disaster | Kaggle

import pandas as pd
df = pd.read_csv('train.csv')

pandas.read_csv — pandas 1.0.5 documentation

統計情報を出力したい

df.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

# 出力するカラムを絞る
df['Age'].describe()

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

pandas.DataFrame.describe — pandas 1.0.5 documentation

データの件数を調べたい

df['Age'].count()

None、NaN、NaT 以外の値が含まれている行数/列数を調べることができる。

pandas.DataFrame.count — pandas 1.0.5 documentation

データを絞り込みたい

# 20 < Age < 40 の行を取り出す
df[(20 < df['Age']) & (df['Age'] < 40)].head()

Index	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

複数のAND/OR条件で絞り込みたい場合は、df[(A) & (B)]のように、条件を () で囲んで指定する。

Python Pandas: Boolean indexing on multiple columns - Stack Overflow

カテゴリー化データを数値に変換したい

# Embarked(C, Q, S)を数値(1, 2, 3)に変換
df['Embarked'] = df['Embarked'].map({'C': 1, 'Q': 2, 'S': 3})

Index	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	3.0
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	1.0
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	3.0
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	3.0
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	3.0

pandas.Series.map — pandas 1.0.4 documentation

列名を変更したい

# Sex(female, male)を数値(0, 1)に変換し、列名(Sex)をMaleに変更
df['Sex'] = df['Sex'].map({'female': 0, 'male': 1})
df = df.rename(columns={'Sex': 'Male'})

Index	PassengerId	Survived	Pclass	Name	Male	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	1	22.0	1	A/5 21171	7.2500	NaN	3.0
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	0	38.0	1	PC 17599	71.2833	C85	1.0
2	3	1	3	Heikkinen, Miss. Laina	0	26.0	0	STON/O2. 3101282	7.9250	NaN	3.0
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0	35.0	1	113803	53.1000	C123	3.0
4	5	0	3	Allen, Mr. William Henry	1	35.0	0	373450	8.0500	NaN	3.0

pandas.DataFrame.rename — pandas 1.0.4 documentation

列名の一覧を持つ配列を渡せば、全ての列名を一括で変更することもできる。

pd.DataFrame({'c': [1, 2], 'd': [10, 20]}).columns = ['a', 'b']

Index	a	b
0	1	10
1	2	20

python - Renaming columns in pandas - Stack Overflow

欠損値の数を列ごとに確認したい

df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Male             0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

pandas.isnull — pandas 1.0.4 documentation
pandas.DataFrame.sum — pandas 1.0.4 documentation

欠損値を除外したい

# 欠損値を含む行全てを除外
df_dn = df.dropna()
df_dn.count()

PassengerId    183
Survived       183
Pclass         183
Name           183
Male           183
Age            183
SibSp          183
Parch          183
Ticket         183
Fare           183
Cabin          183
Embarked       183
dtype: int64

pandas.DataFrame.dropna — pandas 1.0.5 documentation

欠損値を補完したい

# 列 Age の欠損値を Age の中央値で補完
df['Age'] = df['Age'].fillna(df['Age'].median())
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

pandas.DataFrame.fillna — pandas 1.2.0 documentation

指定した列を取り出したい

# Survived と Age 列を取り出す
df[['Survived', 'Age']]

Index	Survived	Age
0	0	22.0
1	1	38.0
2	1	26.0
3	1	35.0
4	0	35.0

Indexing and selecting data — pandas 1.0.4 documentation
pandasで任意の位置の値を取得・変更するat, iat, loc, iloc | note.nkmk.me

指定した列を除外したい

df_dn = df.drop('Cabin', axis='columns')

Index	PassengerId	Survived	Pclass	Name	Male	Age	SibSp	Ticket	Fare	Embarked
0	1	0	3	Braund, Mr. Owen Harris	1	22.0	1	A/5 21171	7.2500	3.0
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	0	38.0	1	PC 17599	71.2833	1.0
2	3	1	3	Heikkinen, Miss. Laina	0	26.0	0	STON/O2. 3101282	7.9250	3.0
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0	35.0	1	113803	53.1000	3.0
4	5	0	3	Allen, Mr. William Henry	1	35.0	0	373450	8.0500	3.0

pandas.DataFrame.dropna — pandas 1.0.5 documentation

行/列の値に関数を適用し、新しい行/列を作りたい

import re

# 敬称を抽出する関数
def getTitle(row):
    name = row['Name']
    p = re.compile('.*\ (.*)\.\ .*')
    surname = p.search(name)
    return surname.group(1)

df['Title'] = df.apply(getTitle, axis='columns')

Index	PassengerId	Survived	Pclass	Name	Male	Age	SibSp	Ticket	Fare	Cabin	Embarked	Title
0	1	0	3	Braund, Mr. Owen Harris	1	22.0	1	A/5 21171	7.2500	NaN	3.0	Mr
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	0	38.0	1	PC 17599	71.2833	C85	1.0	Mrs
2	3	1	3	Heikkinen, Miss. Laina	0	26.0	0	STON/O2. 3101282	7.9250	NaN	3.0	Miss
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0	35.0	1	113803	53.1000	C123	3.0	Mrs
4	5	0	3	Allen, Mr. William Henry	1	35.0	0	373450	8.0500	NaN	3.0	Mr

pandas.DataFrame.apply — pandas 1.0.5 documentation

カテゴリごとの平均を計算したい

# 敬称ごとの平均年齢を求める
df.groupby('Title').mean()['Age']

Title
Capt        70.000000
Col         58.000000
Countess    33.000000
Don         40.000000
Dr          42.000000
Jonkheer    38.000000
L           54.000000
Lady        48.000000
Major       48.500000
Master       4.574167
Miss        21.773973
Mlle        24.000000
Mme         24.000000
Mr          32.368090
Mrs         35.728972
Ms          28.000000
Rev         43.166667
Sir         49.000000
Name: Age, dtype: float64

df.groupby('Title').count() とすると、敬称ごとのデータ件数を求めることも出来る。
Pandas の groupby の使い方 - Qiita

値で行をソートしたい

df.sort_values(by='Age')

Index	PassengerId	Survived	Pclass	Name	Male	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Title	AgeMean
803	804	1	3	Thomas, Master. Assad Alexander	1	0.42	0	1	2625	8.5167	NaN	1.0	Master	NaN
755	756	1	2	Hamalainen, Master. Viljo	1	0.67	1	1	250649	14.5000	NaN	3.0	Master	NaN
644	645	1	3	Baclini, Miss. Eugenie	0	0.75	2	1	2666	19.2583	NaN	1.0	Miss	NaN
469	470	1	3	Baclini, Miss. Helene Barbara	0	0.75	2	1	2666	19.2583	NaN	1.0	Miss	NaN
78	79	1	2	Caldwell, Master. Alden Gates	1	0.83	0	2	248738	29.0000	NaN	3.0	Master	NaN

pandas.DataFrame.sort_values — pandas 1.0.5 documentation

通常 sort_values()を実行した DaraFrame は変更されず、返り値がソートされた状態で得られる。
ascending=Falseを指定すると、指定した列の降順でソートされる。
inplace=Trueを指定すると、sort_values()を実行した DataFrame がソートされ、返り値は None になる。

列に含まれているユニークな値を調べたい

df['Survived'].unique()

array([0, 1], dtype=int64)

pandas.unique — pandas 1.0.5 documentation

特定の文字列を含む行を取り出したい

df[df['Name'].str.contains('Thomas')]

Index	PassengerId	Survived	Pclass	Name	Male	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Title	AgeMean
149	150	0	2	Byles, Rev. Thomas Roussel Davids	1	42.00	0	0	244310	13.0000	NaN	3.0	Rev	NaN
151	152	1	1	Pears, Mrs. Thomas (Edith Wearne)	0	22.00	1	0	113776	66.6000	C2	3.0	Mrs	NaN
159	160	0	3	Sage, Master. Thomas Henry	1	NaN	8	2	CA. 2343	69.5500	NaN	3.0	Master	NaN
186	187	1	3	O'Brien, Mrs. Thomas (Johanna "Hannah" Godfrey)	0	NaN	1	0	370365	15.5000	NaN	2.0	Mrs	NaN
252	253	0	1	Stead, Mr. William Thomas	1	62.00	0	0	113514	26.5500	C87	3.0	Mr	NaN

pandas.Series.str.contains — pandas 1.0.5 documentation
python - How to filter rows containing a string pattern from a Pandas dataframe - Stack Overflow

特定の文字列を含まない値を取り出したい場合は~演算子を使う。

df[~df['Name'].str.contains('Thomas')]

Index	PassengerId	Survived	Pclass	Name	Male	Age	SibSp	Ticket	Fare	Cabin	Embarked	Title	AgeMean
0	1	0	3	Braund, Mr. Owen Harris	1	22.0	1	A/5 21171	7.2500	NaN	3.0	Mr	NaN
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	0	38.0	1	PC 17599	71.2833	C85	1.0	Mrs	NaN
2	3	1	3	Heikkinen, Miss. Laina	0	26.0	0	STON/O2. 3101282	7.9250	NaN	3.0	Miss	NaN
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0	35.0	1	113803	53.1000	C123	3.0	Mrs	NaN
4	5	0	3	Allen, Mr. William Henry	1	35.0	0	373450	8.0500	NaN	3.0	Mr	NaN

python - Search for "does-not-contain" on a DataFrame in pandas - Stack Overflow

データフレーム表示時に色をつけたい

# 値が "Mr" のカラムの背景色を黄色にする
df.style.apply(lambda x: ['background-color: yellow' if v == 'Mr' else '' for v in x])

Index	PassengerId	Survived	Pclass	Name	Male	Age	SibSp	Ticket	Fare	Cabin	Embarked	Title	AgeMean
0	1	0	3	Braund, Mr. Owen Harris	1	22.000000	1	A/5 21171	7.250000	nan	3.000000	Mr	nan
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	0	38.000000	1	PC 17599	71.283300	C85	1.000000	Mrs	nan
2	3	1	3	Heikkinen, Miss. Laina	0	26.000000	0	STON/O2. 3101282	7.925000	nan	3.000000	Miss	nan
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0	35.000000	1	113803	53.100000	C123	3.000000	Mrs	nan
4	5	0	3	Allen, Mr. William Henry	1	35.000000	0	373450	8.050000	nan	3.000000	Mr	nan

Jupyter Notebook で開くと、該当カラムが背景色付きで表示される。
GitHub 上で Jupyter Notebook を開くと、背景色が付かないので注意。

pandas.io.formats.style.Styler.apply — pandas 1.0.5 documentation
python - Pandas style function to highlight specific columns - Stack Overflow

CSV 形式で出力したい

df.to_csv('output.csv', index=False)

インデックス(行番号)を含みたくない場合は index=False を指定する。
pandas.DataFrame.to_csv — pandas 1.0.5 documentation

ファイル最終行に改行を入れたくない場合、最終行のみ line_terminator="" を渡す
python - How to stop writing a blank line at the end of csv file - pandas - Stack Overflow

103

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up