76
103

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

Pandas で○○したい

Last updated at Posted at 2020-07-06

更新履歴

  • 2020/07/08
    データの表示形式を CSV から Markdown Table に変更しました。
    CSV から Markdown Table への変換には CSV to Markdown Table Generator を使用しました。

  • 2021/01/18
    欠損値を補完する方法を追記しました。


Pandas で「あれ、〇〇したいときどうすれば良いんだっけ?」となることが多いので、用途別にまとめます。

前提

今回のサンプルコードでは、
Kaggle社が提供しているタイタニック号の生存者リスト(train.csv)を
pandas.read_csv() で読み込んで使用します。

Titanic: Machine Learning from Disaster | Kaggle

import pandas as pd
df = pd.read_csv('train.csv')

pandas.read_csv — pandas 1.0.5 documentation

統計情報を出力したい

df.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
# 出力するカラムを絞る
df['Age'].describe()
count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

pandas.DataFrame.describe — pandas 1.0.5 documentation

データの件数を調べたい

df['Age'].count()
714

NoneNaNNaT 以外の値が含まれている行数/列数を調べることができる。

データを絞り込みたい

# 20 < Age < 40 の行を取り出す
df[(20 < df['Age']) & (df['Age'] < 40)].head()
Index PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

複数のAND/OR条件で絞り込みたい場合は、df[(A) & (B)]のように、条件を () で囲んで指定する。

カテゴリー化データを数値に変換したい

# Embarked(C, Q, S)を数値(1, 2, 3)に変換
df['Embarked'] = df['Embarked'].map({'C': 1, 'Q': 2, 'S': 3})
Index PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN 3.0
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 1.0
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN 3.0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 3.0
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN 3.0

pandas.Series.map — pandas 1.0.4 documentation

列名を変更したい

# Sex(female, male)を数値(0, 1)に変換し、列名(Sex)をMaleに変更
df['Sex'] = df['Sex'].map({'female': 0, 'male': 1})
df = df.rename(columns={'Sex': 'Male'})
Index PassengerId Survived Pclass Name Male Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 NaN 3.0
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 C85 1.0
2 3 1 3 Heikkinen, Miss. Laina 0 26.0 0 0 STON/O2. 3101282 7.9250 NaN 3.0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.0 1 0 113803 53.1000 C123 3.0
4 5 0 3 Allen, Mr. William Henry 1 35.0 0 0 373450 8.0500 NaN 3.0

pandas.DataFrame.rename — pandas 1.0.4 documentation

列名の一覧を持つ配列を渡せば、全ての列名を一括で変更することもできる。

pd.DataFrame({'c': [1, 2], 'd': [10, 20]}).columns = ['a', 'b']
Index a b
0 1 10
1 2 20

python - Renaming columns in pandas - Stack Overflow

欠損値の数を列ごとに確認したい

df.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Male             0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

pandas.isnull — pandas 1.0.4 documentation
pandas.DataFrame.sum — pandas 1.0.4 documentation

欠損値を除外したい

# 欠損値を含む行全てを除外
df_dn = df.dropna()
df_dn.count()
PassengerId    183
Survived       183
Pclass         183
Name           183
Male           183
Age            183
SibSp          183
Parch          183
Ticket         183
Fare           183
Cabin          183
Embarked       183
dtype: int64

pandas.DataFrame.dropna — pandas 1.0.5 documentation

欠損値を補完したい

# 列 Age の欠損値を Age の中央値で補完
df['Age'] = df['Age'].fillna(df['Age'].median())
df.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

pandas.DataFrame.fillna — pandas 1.2.0 documentation

指定した列を取り出したい

# Survived と Age 列を取り出す
df[['Survived', 'Age']]
Index Survived Age
0 0 22.0
1 1 38.0
2 1 26.0
3 1 35.0
4 0 35.0

Indexing and selecting data — pandas 1.0.4 documentation
pandasで任意の位置の値を取得・変更するat, iat, loc, iloc | note.nkmk.me

指定した列を除外したい

df_dn = df.drop('Cabin', axis='columns')
Index PassengerId Survived Pclass Name Male Age SibSp Parch Ticket Fare Embarked
0 1 0 3 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 3.0
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 1.0
2 3 1 3 Heikkinen, Miss. Laina 0 26.0 0 0 STON/O2. 3101282 7.9250 3.0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.0 1 0 113803 53.1000 3.0
4 5 0 3 Allen, Mr. William Henry 1 35.0 0 0 373450 8.0500 3.0

pandas.DataFrame.dropna — pandas 1.0.5 documentation

行/列の値に関数を適用し、新しい行/列を作りたい

import re

# 敬称を抽出する関数
def getTitle(row):
    name = row['Name']
    p = re.compile('.*\ (.*)\.\ .*')
    surname = p.search(name)
    return surname.group(1)

df['Title'] = df.apply(getTitle, axis='columns')
Index PassengerId Survived Pclass Name Male Age SibSp Parch Ticket Fare Cabin Embarked Title
0 1 0 3 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 NaN 3.0 Mr
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 C85 1.0 Mrs
2 3 1 3 Heikkinen, Miss. Laina 0 26.0 0 0 STON/O2. 3101282 7.9250 NaN 3.0 Miss
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.0 1 0 113803 53.1000 C123 3.0 Mrs
4 5 0 3 Allen, Mr. William Henry 1 35.0 0 0 373450 8.0500 NaN 3.0 Mr

pandas.DataFrame.apply — pandas 1.0.5 documentation

カテゴリごとの平均を計算したい

# 敬称ごとの平均年齢を求める
df.groupby('Title').mean()['Age']
Title
Capt        70.000000
Col         58.000000
Countess    33.000000
Don         40.000000
Dr          42.000000
Jonkheer    38.000000
L           54.000000
Lady        48.000000
Major       48.500000
Master       4.574167
Miss        21.773973
Mlle        24.000000
Mme         24.000000
Mr          32.368090
Mrs         35.728972
Ms          28.000000
Rev         43.166667
Sir         49.000000
Name: Age, dtype: float64

df.groupby('Title').count() とすると、敬称ごとのデータ件数を求めることも出来る。
Pandas の groupby の使い方 - Qiita

値で行をソートしたい

df.sort_values(by='Age')
Index PassengerId Survived Pclass Name Male Age SibSp Parch Ticket Fare Cabin Embarked Title AgeMean
803 804 1 3 Thomas, Master. Assad Alexander 1 0.42 0 1 2625 8.5167 NaN 1.0 Master NaN
755 756 1 2 Hamalainen, Master. Viljo 1 0.67 1 1 250649 14.5000 NaN 3.0 Master NaN
644 645 1 3 Baclini, Miss. Eugenie 0 0.75 2 1 2666 19.2583 NaN 1.0 Miss NaN
469 470 1 3 Baclini, Miss. Helene Barbara 0 0.75 2 1 2666 19.2583 NaN 1.0 Miss NaN
78 79 1 2 Caldwell, Master. Alden Gates 1 0.83 0 2 248738 29.0000 NaN 3.0 Master NaN

pandas.DataFrame.sort_values — pandas 1.0.5 documentation

通常 sort_values()を実行した DaraFrame は変更されず、返り値がソートされた状態で得られる。
ascending=Falseを指定すると、指定した列の降順でソートされる。
inplace=Trueを指定すると、sort_values()を実行した DataFrame がソートされ、返り値は None になる。

列に含まれているユニークな値を調べたい

df['Survived'].unique()
array([0, 1], dtype=int64)

pandas.unique — pandas 1.0.5 documentation

特定の文字列を含む行を取り出したい

df[df['Name'].str.contains('Thomas')]
Index PassengerId Survived Pclass Name Male Age SibSp Parch Ticket Fare Cabin Embarked Title AgeMean
149 150 0 2 Byles, Rev. Thomas Roussel Davids 1 42.00 0 0 244310 13.0000 NaN 3.0 Rev NaN
151 152 1 1 Pears, Mrs. Thomas (Edith Wearne) 0 22.00 1 0 113776 66.6000 C2 3.0 Mrs NaN
159 160 0 3 Sage, Master. Thomas Henry 1 NaN 8 2 CA. 2343 69.5500 NaN 3.0 Master NaN
186 187 1 3 O'Brien, Mrs. Thomas (Johanna "Hannah" Godfrey) 0 NaN 1 0 370365 15.5000 NaN 2.0 Mrs NaN
252 253 0 1 Stead, Mr. William Thomas 1 62.00 0 0 113514 26.5500 C87 3.0 Mr NaN

pandas.Series.str.contains — pandas 1.0.5 documentation
python - How to filter rows containing a string pattern from a Pandas dataframe - Stack Overflow

特定の文字列を含まない値を取り出したい場合は~演算子を使う。

df[~df['Name'].str.contains('Thomas')]
Index PassengerId Survived Pclass Name Male Age SibSp Parch Ticket Fare Cabin Embarked Title AgeMean
0 1 0 3 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 NaN 3.0 Mr NaN
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 C85 1.0 Mrs NaN
2 3 1 3 Heikkinen, Miss. Laina 0 26.0 0 0 STON/O2. 3101282 7.9250 NaN 3.0 Miss NaN
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.0 1 0 113803 53.1000 C123 3.0 Mrs NaN
4 5 0 3 Allen, Mr. William Henry 1 35.0 0 0 373450 8.0500 NaN 3.0 Mr NaN

python - Search for "does-not-contain" on a DataFrame in pandas - Stack Overflow

データフレーム表示時に色をつけたい

# 値が "Mr" のカラムの背景色を黄色にする
df.style.apply(lambda x: ['background-color: yellow' if v == 'Mr' else '' for v in x])
Index PassengerId Survived Pclass Name Male Age SibSp Parch Ticket Fare Cabin Embarked Title AgeMean
0 1 0 3 Braund, Mr. Owen Harris 1 22.000000 1 0 A/5 21171 7.250000 nan 3.000000 Mr nan
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 0 38.000000 1 0 PC 17599 71.283300 C85 1.000000 Mrs nan
2 3 1 3 Heikkinen, Miss. Laina 0 26.000000 0 0 STON/O2. 3101282 7.925000 nan 3.000000 Miss nan
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.000000 1 0 113803 53.100000 C123 3.000000 Mrs nan
4 5 0 3 Allen, Mr. William Henry 1 35.000000 0 0 373450 8.050000 nan 3.000000 Mr nan

Jupyter Notebook で開くと、該当カラムが背景色付きで表示される。
GitHub 上で Jupyter Notebook を開くと、背景色が付かないので注意。

pandas.io.formats.style.Styler.apply — pandas 1.0.5 documentation
python - Pandas style function to highlight specific columns - Stack Overflow

CSV 形式で出力したい

df.to_csv('output.csv', index=False)

インデックス(行番号)を含みたくない場合は index=False を指定する。
pandas.DataFrame.to_csv — pandas 1.0.5 documentation

ファイル最終行に改行を入れたくない場合、最終行のみ line_terminator="" を渡す
python - How to stop writing a blank line at the end of csv file - pandas - Stack Overflow

76
103
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
76
103

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?