More than 5 years have passed since last update.

EMC Healthcare株式会社

Pandas DataFrameへの置換操作のまとめ

Last updated at 2018-10-30Posted at 2018-02-15

1. はじめに

1.1 目的と結論

pandas dataframeへの置換操作って，代入でやると時々warning出るし，メソッドだと何がいいんだっけ，という状態だったので整理したメモ．

結論としては，

代入するときはdataframe全体のsliceに代入しないように注意する．
dataframeのメソッドではreplaceが良い．

となった．

1.2 環境

Python 3.6.0

2. 準備

適当に2列のdataframeを準備する．

import pandas as pd
import numpy as np

d = np.arange(10).reshape(5,2)
df = pd.DataFrame(d, columns=["col1" ,"col2"])

# In [77]: df
# Out[77]:
#    col1  col2
# 0     0     1
# 1     2     3
# 2     4     5
# 3     6     7
# 4     8     9

ここではある列ごとに対する置換を想定している．
pandas dataframeは，seriesの寄せ集めなので，各列についてはseries型になってる．

print(type(df.col1))
# <class 'pandas.core.series.Series'>

print(df.col1.dtype)
# int64

3. 代入

置換をする方法その1として，条件にあった行だけ抜き出して，そのまま代入する方法がある．

df.col1[df.col1 == 2] = 100

# In [42]: df
# Out[42]:
#    col1  col2
# 0     0     1
# 1   100     3
# 2     4     5
# 3     6     7
# 4     8     9

3.1. SettingWithCopyWarning

df[df.col1 == 2].col1 = 100

# SettingWithCopyWarning:
# A value is trying to be set on a copy of a slice from a DataFrame.
# Try using .loc[row_indexer,col_indexer] = value instead
# See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

df

# In [90]: df
# Out[90]:
#    col1  col2
# 0     0     1
# 1     2     3
# 2     4     5
# 3     6     7
# 4     8     9

スライスしたものdf[df.col1 == 2]の特定の列['col1']に代入しようとすると，上のようなwarningが出る．
しかも置換できていない．

つまり，列のスライスへの代入は良くても，スライスの列への代入はダメ．
df.col1[df.col1 == 2]はOK.
df[df.col1 == 2].col1はダメ．

（前にこれでwarningが出るだけで代入はできているケースに遭遇した記憶があるが，どんなケースだったか忘れてしまった．いくつか試してもなかなか再現しないので諦めた．）

3.2. floatでもOK

floatでも自動的にintに直してくれるようだ．代入した後もintのまま．

df.col1[df.col1 == 2.] = 100.

# In [46]: df
# Out[46]:
#    col1  col2
# 0     0     1
# 1   100     3
# 2     4     5
# 3     6     7
# 4     8     9

print(df.col1.dtype)
# int64

4. where

「pandas 置換」でググって出て来たやり方．

df.col1 = df.col1.where(df.col1 == 1, 100)

# In [80]: df
# Out[80]:
#    col1  col2
# 0   100     1
# 1     2     3
# 2   100     5
# 3   100     7
# 4   100     9

このwhereは，条件式が成り立つものをとってきて，それ以外はNaNで埋める．

二つ目の引数は，このNaNになるところをそれで埋める．

pandas.Series.where

other : scalar, NDFrame, or callable
Entries where cond is False are replaced with corresponding value from other.

なので想像している置換とは異なる振る舞いをする可能性が高いので，置換として使うのは微妙そう．もちろんFalseを埋めると把握した上では問題ないと思う．

5. replace

多分メソッドではこれが一番直感的に使える．名前もreplaceだし．

df.col1 = df.col1.replace(2, 100)

# In [84]: df
# Out[84]:
#    col1  col2
# 0     0     1
# 1   100     3
# 2     4     5
# 3     6     7
# 4     8     9

5.1. 間違えて条件式を与えると

うっかりスライスの感覚で，条件式を第1引数に与えてしまうと以下のような想像と異なる挙動をするので注意．

df
#    col1  col2
# 0     0     1
# 1     2     3
# 2     4     5
# 3     6     7
# 4     8     9


df.col1 = df.col1.replace(df.col1 >= 3, 100)

df
#    col1  col2
# 0   100     1
# 1     2     3
# 2     4     5
# 3     6     7
# 4     8     9

これは，replaceの第1引数が，
pandas.Series.replace

to_replace : str, regex, list, dict, Series, numeric, or None

となっており，リストも受け取れる．

置換する前のdfに対する条件式の評価は以下のようなboolとなっており，

df.col1 >= 3

# 0    False
# 1    False
# 2     True
# 3     True
# 4     True
# Name: col1, dtype: bool

このboolを自動でintで解釈して，[0,0,1,1,1]とdfの[0,2,4,6,8]を比較して，あれ，第1要素はマッチするじゃん，ここは100で置換しよう，となってる．

という感じでやりたいこと（ここでは3以上を置換する）とは異なった挙動をするので注意が必要．

6. loc

2018/3/6追加．
このやり方も良さげ．
不等式のスライスでもいけるのが良い．

df.loc[df.col1 > 4, "col2"] = 100

df
#    col1  col2
# 0     0     1
# 1     2     3
# 2     4     5
# 3     6   100
# 4     8   100

7. 複数の列に対する条件で絞り込んで置換

2018/10/30追加．

data = {"sex": ["M", "M", "F", "F", "F"], "age": [30, 40, 40, 40, 10], "val": [1, 2, 3, 4, 5]}
df = pd.DataFrame(data)

#   sex  age  val
# 0   M   30    1
# 1   M   40    2
# 2   F   40    3
# 3   F   40    4
# 4   F   10    5

query_str = "sex == 'F' and age == 40"
df_subset = df.query(query_str)
df.loc[df_subset.index, "val"] = 100

#   sex  age  val
# 0   M   30    1
# 1   M   40    2
# 2   F   40  100
# 3   F   40  100
# 4   F   10    5

もっとうまいやり方ありそうだけどとりあえずメモとして．

107

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up