More than 1 year has passed since last update.

df.assign(column=lambda)でif文がうまく機能しない件

Posted at 2024-05-06

概要

以下のコードの実行するとどうなるでしょう？

コード

import pandas as pd
df = pd.DataFrame(data={ "A": [1, 0, 1], "B": [101, 102, 103]})
df = df.assign(C=lambda x: x.B if x.A == 1 else 0)
df

結果

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/vr/jqnz40rd3hnfhg8qrp388zyr0000gn/T/ipykernel_29170/1740181337.py in ?()
      1 import pandas as pd
      2 df = pd.DataFrame(data={ "A": [1, 0, 1], "B": [101, 102, 103]})
----> 3 df = df.assign(C=lambda x: x.B if x.A == 1 else 0)
      4 df

/opt/homebrew/lib/python3.11/site-packages/pandas/core/frame.py in ?(self, **kwargs)
   5235         """
   5236         data = self.copy(deep=None)
   5237 
   5238         for k, v in kwargs.items():
-> 5239             data[k] = com.apply_if_callable(v, data)
   5240         return data

/opt/homebrew/lib/python3.11/site-packages/pandas/core/common.py in ?(maybe_callable, obj, **kwargs)
    380     obj : NDFrame
    381     **kwargs
    382     """
    383     if callable(maybe_callable):
--> 384         return maybe_callable(obj, **kwargs)
    385 
    386     return maybe_callable

/var/folders/vr/jqnz40rd3hnfhg8qrp388zyr0000gn/T/ipykernel_29170/1740181337.py in ?(x)
----> 3 df = df.assign(C=lambda x: x.B if x.A == 1 else 0)

/opt/homebrew/lib/python3.11/site-packages/pandas/core/generic.py in ?(self)
   1575     @final
   1576     def __nonzero__(self) -> NoReturn:
-> 1577         raise ValueError(
   1578             f"The truth value of a {type(self).__name__} is ambiguous. "
   1579             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
   1580         )

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

原因

ベクトル化していない関数を使うとまずいようです。

Your mistake is that you considered the lambda to act on rows, while it acts on full columns in a vectorized way. You need to use vectorized functions:
https://stackoverflow.com/questions/75404602/using-an-if-statement-inside-a-pandas-dataframes-assign-method

解決策

numpyのwhereメソッドを利用しましょう。

numpy.whereはベクトル化された関数です。

np.where(a > b, a - b, a + b) is "vectorized" because all arguments to the where work with arrays, and where itself uses them with full broadcasting powers.
https://stackoverflow.com/questions/71904004/numpy-vectorization-without-the-use-of-numpy-vectorize

コード

import pandas as pd
import numpy as np
df = pd.DataFrame(data={ "A": [1, 0, 1], "B": [101, 102, 103]})
df = df.assign(C=lambda x: np.where(x.A == 1, x.B, 0))
df

結果

A	B	C
0	1	101	101
1	0	102	0
2	1	103	103

もう一つの例

こちらは小文字の数を数えて新しいカラムとして追加するコードのはずですが、想定通りには動きません。

コード

def count_upper(lst):
    return sum(1 for l in lst if l.isupper())
        
df = pd.DataFrame(data={ "A": ["AAAAAaaa", "Aaaa", "AAAaa"]})
df = df.assign(C=lambda x: count_upper(x.A.str.strip()))
df

結果

	A	C
0	AAAAAaaa	0
1	Aaaa	0
2	AAAaa	0

こちらもおそらく同様の原因だと考えられます。

改善したコード

def count_upper(lst):
    return sum(1 for l in lst if l.isupper())
        
df = pd.DataFrame(data={ "A": ["AAAAAaaa", "Aaaa", "AAAaa"]})
df = df.assign(C=lambda x: x.A.str.strip().apply(count_upper))
df

結果

	A	C
0	AAAAAaaa	5
1	Aaaa	1
2	AAAaa	3

まとめ

pandasはベクトル化が重要な要素となっている。
ベクトル化とはbroadcastして計算することであり、配列の計算を高速化する手法であるためこのような実装となっています。
今回示した解決策はそのような罠を回避するための方法であり、必ずしもその高速化の恩恵を受けられているとは限りません。以下のリンクの記事が示しているように、applyをただ使うだけで高速になるわけではありません。

broadcastやベクトル化について理解を深める必要があると思った。

参考になりそうなサイト。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up