更新履歴
-
2020/07/08
データの表示形式を CSV から Markdown Table に変更しました。
CSV から Markdown Table への変換には CSV to Markdown Table Generator を使用しました。 -
2021/01/18
欠損値を補完する方法を追記しました。
Pandas で「あれ、〇〇したいときどうすれば良いんだっけ?」となることが多いので、用途別にまとめます。
前提
今回のサンプルコードでは、
Kaggle社が提供しているタイタニック号の生存者リスト(train.csv
)を
pandas.read_csv()
で読み込んで使用します。
Titanic: Machine Learning from Disaster | Kaggle
import pandas as pd
df = pd.read_csv('train.csv')
pandas.read_csv — pandas 1.0.5 documentation
統計情報を出力したい
df.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
# 出力するカラムを絞る
df['Age'].describe()
count 714.000000
mean 29.699118
std 14.526497
min 0.420000
25% 20.125000
50% 28.000000
75% 38.000000
max 80.000000
Name: Age, dtype: float64
pandas.DataFrame.describe — pandas 1.0.5 documentation
データの件数を調べたい
df['Age'].count()
714
None
、NaN
、NaT
以外の値が含まれている行数/列数を調べることができる。
データを絞り込みたい
# 20 < Age < 40 の行を取り出す
df[(20 < df['Age']) & (df['Age'] < 40)].head()
Index | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
複数のAND/OR条件で絞り込みたい場合は、df[(A) & (B)]
のように、条件を ()
で囲んで指定する。
カテゴリー化データを数値に変換したい
# Embarked(C, Q, S)を数値(1, 2, 3)に変換
df['Embarked'] = df['Embarked'].map({'C': 1, 'Q': 2, 'S': 3})
Index | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 3.0 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1.0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 3.0 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 3.0 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 3.0 |
pandas.Series.map — pandas 1.0.4 documentation
列名を変更したい
# Sex(female, male)を数値(0, 1)に変換し、列名(Sex)をMaleに変更
df['Sex'] = df['Sex'].map({'female': 0, 'male': 1})
df = df.rename(columns={'Sex': 'Male'})
Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 3.0 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1.0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 3.0 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 3.0 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 3.0 |
pandas.DataFrame.rename — pandas 1.0.4 documentation
列名の一覧を持つ配列を渡せば、全ての列名を一括で変更することもできる。
pd.DataFrame({'c': [1, 2], 'd': [10, 20]}).columns = ['a', 'b']
Index | a | b |
---|---|---|
0 | 1 | 10 |
1 | 2 | 20 |
python - Renaming columns in pandas - Stack Overflow
欠損値の数を列ごとに確認したい
df.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Male 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
pandas.isnull — pandas 1.0.4 documentation
pandas.DataFrame.sum — pandas 1.0.4 documentation
欠損値を除外したい
# 欠損値を含む行全てを除外
df_dn = df.dropna()
df_dn.count()
PassengerId 183
Survived 183
Pclass 183
Name 183
Male 183
Age 183
SibSp 183
Parch 183
Ticket 183
Fare 183
Cabin 183
Embarked 183
dtype: int64
pandas.DataFrame.dropna — pandas 1.0.5 documentation
欠損値を補完したい
# 列 Age の欠損値を Age の中央値で補完
df['Age'] = df['Age'].fillna(df['Age'].median())
df.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
pandas.DataFrame.fillna — pandas 1.2.0 documentation
指定した列を取り出したい
# Survived と Age 列を取り出す
df[['Survived', 'Age']]
Index | Survived | Age |
---|---|---|
0 | 0 | 22.0 |
1 | 1 | 38.0 |
2 | 1 | 26.0 |
3 | 1 | 35.0 |
4 | 0 | 35.0 |
Indexing and selecting data — pandas 1.0.4 documentation
pandasで任意の位置の値を取得・変更するat, iat, loc, iloc | note.nkmk.me
指定した列を除外したい
df_dn = df.drop('Cabin', axis='columns')
Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | 3.0 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | 1.0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | 3.0 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | 3.0 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | 3.0 |
pandas.DataFrame.dropna — pandas 1.0.5 documentation
行/列の値に関数を適用し、新しい行/列を作りたい
import re
# 敬称を抽出する関数
def getTitle(row):
name = row['Name']
p = re.compile('.*\ (.*)\.\ .*')
surname = p.search(name)
return surname.group(1)
df['Title'] = df.apply(getTitle, axis='columns')
Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 3.0 | Mr |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1.0 | Mrs |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 3.0 | Miss |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 3.0 | Mrs |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 3.0 | Mr |
pandas.DataFrame.apply — pandas 1.0.5 documentation
カテゴリごとの平均を計算したい
# 敬称ごとの平均年齢を求める
df.groupby('Title').mean()['Age']
Title
Capt 70.000000
Col 58.000000
Countess 33.000000
Don 40.000000
Dr 42.000000
Jonkheer 38.000000
L 54.000000
Lady 48.000000
Major 48.500000
Master 4.574167
Miss 21.773973
Mlle 24.000000
Mme 24.000000
Mr 32.368090
Mrs 35.728972
Ms 28.000000
Rev 43.166667
Sir 49.000000
Name: Age, dtype: float64
df.groupby('Title').count()
とすると、敬称ごとのデータ件数を求めることも出来る。
Pandas の groupby の使い方 - Qiita
値で行をソートしたい
df.sort_values(by='Age')
Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title | AgeMean |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
803 | 804 | 1 | 3 | Thomas, Master. Assad Alexander | 1 | 0.42 | 0 | 1 | 2625 | 8.5167 | NaN | 1.0 | Master | NaN |
755 | 756 | 1 | 2 | Hamalainen, Master. Viljo | 1 | 0.67 | 1 | 1 | 250649 | 14.5000 | NaN | 3.0 | Master | NaN |
644 | 645 | 1 | 3 | Baclini, Miss. Eugenie | 0 | 0.75 | 2 | 1 | 2666 | 19.2583 | NaN | 1.0 | Miss | NaN |
469 | 470 | 1 | 3 | Baclini, Miss. Helene Barbara | 0 | 0.75 | 2 | 1 | 2666 | 19.2583 | NaN | 1.0 | Miss | NaN |
78 | 79 | 1 | 2 | Caldwell, Master. Alden Gates | 1 | 0.83 | 0 | 2 | 248738 | 29.0000 | NaN | 3.0 | Master | NaN |
pandas.DataFrame.sort_values — pandas 1.0.5 documentation
通常 sort_values()
を実行した DaraFrame は変更されず、返り値がソートされた状態で得られる。
ascending=False
を指定すると、指定した列の降順でソートされる。
inplace=True
を指定すると、sort_values()
を実行した DataFrame がソートされ、返り値は None
になる。
列に含まれているユニークな値を調べたい
df['Survived'].unique()
array([0, 1], dtype=int64)
pandas.unique — pandas 1.0.5 documentation
特定の文字列を含む行を取り出したい
df[df['Name'].str.contains('Thomas')]
Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title | AgeMean |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
149 | 150 | 0 | 2 | Byles, Rev. Thomas Roussel Davids | 1 | 42.00 | 0 | 0 | 244310 | 13.0000 | NaN | 3.0 | Rev | NaN |
151 | 152 | 1 | 1 | Pears, Mrs. Thomas (Edith Wearne) | 0 | 22.00 | 1 | 0 | 113776 | 66.6000 | C2 | 3.0 | Mrs | NaN |
159 | 160 | 0 | 3 | Sage, Master. Thomas Henry | 1 | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | 3.0 | Master | NaN |
186 | 187 | 1 | 3 | O'Brien, Mrs. Thomas (Johanna "Hannah" Godfrey) | 0 | NaN | 1 | 0 | 370365 | 15.5000 | NaN | 2.0 | Mrs | NaN |
252 | 253 | 0 | 1 | Stead, Mr. William Thomas | 1 | 62.00 | 0 | 0 | 113514 | 26.5500 | C87 | 3.0 | Mr | NaN |
pandas.Series.str.contains — pandas 1.0.5 documentation
python - How to filter rows containing a string pattern from a Pandas dataframe - Stack Overflow
特定の文字列を含まない値を取り出したい場合は~
演算子を使う。
df[~df['Name'].str.contains('Thomas')]
Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title | AgeMean |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 3.0 | Mr | NaN |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1.0 | Mrs | NaN |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 3.0 | Miss | NaN |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 3.0 | Mrs | NaN |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 3.0 | Mr | NaN |
python - Search for "does-not-contain" on a DataFrame in pandas - Stack Overflow
データフレーム表示時に色をつけたい
# 値が "Mr" のカラムの背景色を黄色にする
df.style.apply(lambda x: ['background-color: yellow' if v == 'Mr' else '' for v in x])
Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title | AgeMean |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.000000 | 1 | 0 | A/5 21171 | 7.250000 | nan | 3.000000 | Mr | nan |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | 0 | 38.000000 | 1 | 0 | PC 17599 | 71.283300 | C85 | 1.000000 | Mrs | nan |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.000000 | 0 | 0 | STON/O2. 3101282 | 7.925000 | nan | 3.000000 | Miss | nan |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.000000 | 1 | 0 | 113803 | 53.100000 | C123 | 3.000000 | Mrs | nan |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.000000 | 0 | 0 | 373450 | 8.050000 | nan | 3.000000 | Mr | nan |
Jupyter Notebook で開くと、該当カラムが背景色付きで表示される。
GitHub 上で Jupyter Notebook を開くと、背景色が付かないので注意。
pandas.io.formats.style.Styler.apply — pandas 1.0.5 documentation
python - Pandas style function to highlight specific columns - Stack Overflow
CSV 形式で出力したい
df.to_csv('output.csv', index=False)
インデックス(行番号)を含みたくない場合は index=False
を指定する。
pandas.DataFrame.to_csv — pandas 1.0.5 documentation
ファイル最終行に改行を入れたくない場合、最終行のみ line_terminator=""
を渡す
python - How to stop writing a blank line at the end of csv file - pandas - Stack Overflow