More than 5 years have passed since last update.

justInCaseAdvent Calendar 2018

NaN とは NaN ぞや (Python) (394日遅刻)

Last updated at 2020-01-25Posted at 2020-01-11

update1 2020-01-25: typo 修正 IEEE745 -> IEEE 754

In [1]: from datetime import datetime  
In [2]: (datetime(2020, 1, 11) - datetime(2018, 12, 13)).days                           
Out[2]: 394

Python における nan の扱いについて解説することとする。以下では概念としての nan の表記を NaN と表記する。

Disclaimer:
この投稿は justInCase Advent Calendar 2018 向けであり、約400日の期間を経て投稿されましたが、成熟期間を経たことによって内容が充実している訳ではありません。

要約

Python での NaN は IEEE 754 の NaN を踏襲しているが、ところどころにハマりポイントが存在する。
float nan ではない Decimal('nan'), pd.NaT, numpy.datetime64('NaT') の存在に注意
numpy, pandas module から callできる nan object と math.nan は同じもの。どれを使ってもよい。（けど可読性の観点から統一した方が良い）
pandas の isna(...) method は nan だけでなく None, pd.NaT なども欠損値扱いで True を返すことに注意
pandas の欠損値は pd.NA が pandas 1.0.0 から導入されるとのこと。欠損値としてはnanではなく pd.NA を使うことが今後望ましいとされる。これについては別記事でも書こうかと。
- https://mobile.twitter.com/jorisvdbossche/status/1208476049690046465
- https://dev.pandas.io/docs/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values

検証環境は最後に記載

IEEE754 における NaN の取り扱い

前回記事の NaN とは NaN ぞや (R) をご参照ください。

なお quiet NaN は一般的な数値演算において伝播(propagate)しますが、次の二つの値は何を返す（べき）と思いますか？ IEEE 754-2008, IEEE 754-2019 で実は min, max での NaN の扱いが変わります。の解説は別記事で。

min(1.0, float('nan'))
max(1.0, float('nan'))

Python での NaN の呼び出し方

言語リテラルは存在しない。module 呼び出しが不要な float('nan') または numpy を call している場合は np.nan が多く使われる傾向がある。

import math
import decimal
import numpy as np
import pandas as pd

float('nan')
math.nan
0.0 * math.inf
math.inf / math.inf
# 0.0/0.0 Pythonでは ZeroDivisionError. C, R, julia など多くの言語は NaN を返す
np.nan
np.core.numeric.NaN
pd.np.nan

全て float object. シングルトンオブジェクトではないが numpy pandas で参照されるオブジェクトは同一。

nans = [float('nan'), math.nan, 0 * math.inf, math.inf / math.inf, np.nan, np.core.numeric.NaN, pd.np.nan]

import pprint
pprint.pprint([(type(n), id(n)) for n in nans])
# [(<class 'float'>, 4544450768),
#  (<class 'float'>, 4321186672),
#  (<class 'float'>, 4544450704),
#  (<class 'float'>, 4544450832),
#  (<class 'float'>, 4320345936),
#  (<class 'float'>, 4320345936),
#  (<class 'float'>, 4320345936)]

float('nan') 自体は float class の immutable object であるので hashable。したがってdictionary の key になり得るが、奇妙なことに nan を複数追加できてしまう。そして key を事前に変数に束縛しておかないと、二度と取り出せない。これは NaN の比較演算子の結果は全て False すなわち float('nan') == float('nan') -> False であることに由来すると考えられる。

>>> d = {float('nan'): 1, float('nan'): 2, float('nan'): 3}
>>> d
{nan: 1, nan: 2, nan: 3}
>>> d[float('nan')]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: nan

float class ではない NaN っぽい性質を持つオブジェクトの存在に注意。特に pd.NaT と np.datetime64("NaT") は違うクラス。

decimal.Decimal('nan')
pd.NaT
np.datetime64("NaT")

# >>> type(decimal.Decimal('nan'))
# <class 'decimal.Decimal'>

# >>> type(pd.NaT)
# <class 'pandas._libs.tslibs.nattype.NaTType'>

# >>> type(np.datetime64("NaT"))
# <class 'numpy.datetime64'>

なので np.isnat の利用は次の注意が必要。

>>> np.isnat(pd.NaT)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ufunc 'isnat' is only defined for datetime and timedelta.

>>> np.isnat(np.datetime64("NaT"))
True

NaN チェック

math.isnan
np.isnan
pd.isna

math.isnan の実態はこの辺り。
https://github.com/python/cpython/blob/e42b705188271da108de42b55d9344642170aa2b/Include/pymath.h#L88-L103
https://github.com/python/cpython/blob/34fd4c20198dea6ab2fe8dc6d32d744d9bde868d/Lib/_pydecimal.py#L713-L726

/* Py_IS_NAN(X)
 * Return 1 if float or double arg is a NaN, else 0.
 * Caution:
 *     X is evaluated more than once.
 *     This may not work on all platforms.  Each platform has *some*
 *     way to spell this, though -- override in pyconfig.h if you have
 *     a platform where it doesn't work.
 * Note: PC/pyconfig.h defines Py_IS_NAN as _isnan
 */
# ifndef Py_IS_NAN
# if defined HAVE_DECL_ISNAN && HAVE_DECL_ISNAN == 1
# define Py_IS_NAN(X) isnan(X)
# else
# define Py_IS_NAN(X) ((X) != (X))
# endif
# endif

def _isnan(self):
    """Returns whether the number is not actually one.
    0 if a number
    1 if NaN
    2 if sNaN
    """
    if self._is_special:
        exp = self._exp
        if exp == 'n':
            return 1
        elif exp == 'N':
            return 2
    return 0

pandas の isna method (および isnull も）は float nan だけでなく None, pd.NaT も欠損値扱いで True を返すことに注意。
なお pandas.options.mode.use_inf_as_na = True とすれば np.inf も欠損値判定される tips がある。

>>> pd.isna(math.nan)
True
>>> pd.isna(None)
True

>>> pd.isna(math.inf)
False
>>> pandas.options.mode.use_inf_as_na = True
>>> pd.isna(math.inf)
True

pandas method について

pandas object の直 method は引数に scalar or array-like を取り、返り値も引数と同じサイズのboolとなる。一方で pd.DataFrame の直 method は引数と返り値共に DataFrame となる。

pd.isna # for scalar or array-like
pd.DataFrame.isna # for DataFrame

なお array-like object とは具体的には次の object をさす。 (https://github.com/pandas-dev/pandas/blob/v0.25.3/pandas/core/dtypes/missing.py#L136-L147)

ABCSeries,
np.ndarray,
ABCIndexClass,
ABCExtensionArray,
ABCDatetimeArray,
ABCTimedeltaArray,

なお、pd.isna と pd.isnull は全く同一であるためどちらでも良い（可読性の観点から統一して使用が望ましい）。

# https://github.com/pandas-dev/pandas/blob/v0.25.3/pandas/core/dtypes/missing.py#L125
>>> id(pd.isnull)
4770964688
>>> id(pd.isna)
4770964688

isメソッドまとめ

予期せぬ error に出会いたくないならば pd.isna が無難だが Decimal('nan') の判定に漏れるので注意。

	math.nan	decimal.Decimal('nan')	np.datetime64("NaT")	pd.NaT	math.inf	None
math.isnan	True	True	error	error	False	error
decimal.Decimal.is_nan	error	True	error	error	error	error
np.isnan	True	error	True	error	False	error
pd.isna	True	False	True	True	False	True
np.isnat	error	error	True	error	error	error

その他

binary 表現の確認。 quiet NaN であるがわかる。

>>> import struct
>>> xs = struct.pack('>d', math.nan)
>>> xs
b'\x7f\xf8\x00\x00\x00\x00\x00\x00'
>>> xn = struct.unpack('>Q', xs)[0]
>>> xn
9221120237041090560
>>> bin(xn)
'0b111111111111000000000000000000000000000000000000000000000000000'

要約（再掲）

Python での NaN は IEEE754の NaN を踏襲しているが、ところどころにハマりポイントが存在する。
float nan ではない Decimal('nan'), pd.NaT, numpy.datetime64('NaT') の存在に注意
numpy, pandas module から callできる nan object と math.nan は同じもの。どれを使ってもよい。（けど可読性の観点から統一した方が良い）
pandas の isna(...) method は nan だけでなく None, NaT なども欠損値扱いで True を返すことに注意
pandas の欠損値は pd.NA が pandas 1.0.0 から導入される。欠損値としてはnanではなく pd.NA を使うことが今後される。これについては別記事でも書こうかと。
- https://mobile.twitter.com/jorisvdbossche/status/1208476049690046465
- https://dev.pandas.io/docs/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values

最後に
こういうマニアックな話が大好きな方、ぜひ justInCase へ遊びに来て下さい。https://www.wantedly.com/companies/justincase

以上

検証環境

$ uname -a
Darwin MacBook-Pro-3.local 18.7.0 Darwin Kernel Version 18.7.0: Sat Oct 12 00:02:19 PDT 2019; root:xnu-4903.278.12~1/RELEASE_X86_64 x86_64

$ python
Python 3.7.4 (default, Nov 17 2019, 08:06:12) 
[Clang 10.0.1 (clang-1001.0.46.4)] on darwin

$ pip list | grep -e numpy -e pandas
numpy                    1.18.0     
pandas                   0.25.3

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up