More than 5 years have passed since last update.

Pandasメモ ~None, np.nan, 空文字について~

Posted at 2020-02-15

Pandasメモ ~None, np.nan, 空文字について~

pandasのNone, np.nan周りでハマったので、個人用メモ

検証した環境は下記（結果に違いはありませんでした）

python2.7.5, pandas==0.24.2
python3.6.1, pandas==0.25.3

サマリ

	None	np.nan	空文字
DataFrame化	dtypeにobjectを指定しない時以外はnp.nanに変換される	np.nanはintに変換できないため、np.nanが含まれる列は基本的にはfloat型になる	文字型(非数値)として扱われるため、欠損値として扱われず、空文字が含まれる列は基本object型になる
read_csv	-	csv上の空、空文字共にどのdtypeを指定してもnp.nanとして読み込まれる	-
fillna, fropna	欠損値と判定される	欠損値と判定される	欠損値と判定されない
groupby	欠損値と判定され、無視される	欠損値と判定され、無視される	欠損値と判定されない

検証結果

dtype指定でDataFrame化

下記データをそれぞれ異なるdtype指定したときに列の型がどう変わるの検証


df = pd.DataFrame(
    {
        # A列: int+None
        "A": [1, 2, 3, None],
        # B列: str+空文字
        "B": ["1", "2", "3", ""],
        # C列: int+np.nan
        "C": [1, 2, 3, np.nan],
        # D列: intのみ
        "D": [1, 2, 3, 4]
    }
)

dtype指定なし

Noneはnp.nanに変換されるもよう・・・それに伴いnp.nanが含まれる列はfloat64になる

A列: Noneがnp.nanに変換されてfloat64型になる
B列: 値の変換は行われずobject型になる
C列: float64型になる
D列: 値の変換は行われずint64型になる


df = pd.DataFrame(
    {
        "A": [1, 2, 3, None],
        "B": ["1", "2", "3", ""],
        "C": [1, 2, 3, np.nan],
        "D": [1, 2, 3, 4]
    }
)

print(df)

     A  B    C  D
0  1.0  1  1.0  1
1  2.0  2  2.0  2
2  3.0  3  3.0  3
3  NaN     NaN  4

print(df.dtypes)

A    float64
B     object
C    float64
D      int64
dtype: object

print(df.values)

array([[1.0, '1', 1.0, 1],
       [2.0, '2', 2.0, 2],
       [3.0, '3', 3.0, 3],
       [nan, '', nan, 4]], dtype=object)

objectを指定

全ての値に変更はなく、Noneもそのまま


df = pd.DataFrame(
    {
        "A": [1, 2, 3, None],
        "B": ["1", "2", "3", ""],
        "C": [1, 2, 3, np.nan],
        "D": [1, 2, 3, 4]
    },
    dtype=object
)

print(df)

      A  B    C  D
0     1  1    1  1
1     2  2    2  2
2     3  3    3  3
3  None     NaN  4

print(df.dtypes)

A    object
B    object
C    object
D    object
dtype: object

print(df.values)

array([[1, '1', 1, 1],
       [2, '2', 2, 2],
       [3, '3', 3, 3],
       [None, '', nan, 4]], dtype=object)

floatを指定

空文字をfloatに変更できず、空文字を含む列のみobject型になる

A列: Noneがnp.nanに変換されてfloat64型になる
B列: 空文字はfloatに変換できず、object型になる
C列: float64型になる
D列: float64型になる


df = pd.DataFrame(
    {
        "A": [1, 2, 3, None],
        "B": ["1", "2", "3", ""],
        "C": [1, 2, 3, np.nan],
        "D": [1, 2, 3, 4]
    },
    dtype=float
)

print(df)

      A  B    C  D
0     1  1    1  1
1     2  2    2  2
2     3  3    3  3
3  None     NaN  4

print(df.dtypes)

A    float64
B     object
C    float64
D    float64
dtype: object

print(df.values)

array([[1.0, '1', 1.0, 1.0],
       [2.0, '2', 2.0, 2.0],
       [3.0, '3', 3.0, 3.0],
       [nan, '', nan, 4.0]], dtype=object)

intを指定

int64に変換できない列（np.nanやNoneが含まれている列）はobject型になる

A~C列: int64型に変換できず、object型になる
D列: int64型になる


df = pd.DataFrame(
    {
        "A": [1, 2, 3, None],
        "B": ["1", "2", "3", ""],
        "C": [1, 2, 3, np.nan],
        "D": [1, 2, 3, 4]
    },
    dtype=int
)

print(df)

      A  B    C  D
0     1  1    1  1
1     2  2    2  2
2     3  3    3  3
3  None     NaN  4

print(df.dtypes)

A    object
B    object
C    object
D     int64
dtype: object

print(df.values)

array([[1, '1', 1, 1],
       [2, '2', 2, 2],
       [3, '3', 3, 3],
       [None, '', nan, 4]], dtype=object)

dtype指定でread_csv

下記csvをそれぞれ異なるdtype指定したときに列の型がどうなるか検証

sample.csv

# A列: int+空
# B列: 文字列+空文字
# C列: float+空
# D列: intのみ
A,B,C,D
1,"1",1.0,1
2,"2",2.0,2
3,"3",3.0,3
,"",,4

dtype指定なし

空、空文字のいずれもnp.nanとして読み込まれ、それに伴いintはfloatに変換される

A列: 空がnp.nanに変換され、float64型になる
B列: 空文字がnp.nanに変換され、float64型になる
C列: 空がnp.nanに変換され、float64型になる
D列: 値の変換は行われずint64型になる


df = pd.read_csv("sample.csv")

print(df)

     A    B    C  D
0  1.0  1.0  1.0  1
1  2.0  2.0  2.0  2
2  3.0  3.0  3.0  3
3  NaN  NaN  NaN  4

print(df.dtypes)

A    float64
B    float64
C    float64
D      int64
dtype: object

print(df.values)

array([[ 1.,  1.,  1.,  1.],
       [ 2.,  2.,  2.,  2.],
       [ 3.,  3.,  3.,  3.],
       [nan, nan, nan,  4.]])

objectを指定

空、空文字はnp.nanに変換されるが、それ以外の値はstr型に変換される


df = pd.read_csv("sample.csv", dtype=object)

print(df)

     A    B    C  D
0    1    1  1.0  1
1    2    2  2.0  2
2    3    3  3.0  3
3  NaN  NaN  NaN  4

print(df.dtypes)

A    object
B    object
C    object
D    object
dtype: object

print(df.values)

array([['1', '1', '1.0', '1'],
       ['2', '2', '2.0', '2'],
       ['3', '3', '3.0', '3'],
       [nan, nan, nan, '4']], dtype=object)

floatを指定

全ての列がfloat64型に変換される


df = pd.read_csv("sample.csv", dtype=float)

print(df)

     A    B    C    D
0  1.0  1.0  1.0  1.0
1  2.0  2.0  2.0  2.0
2  3.0  3.0  3.0  3.0
3  NaN  NaN  NaN  4.0

print(df.dtypes)

A    float64
B    float64
C    float64
D    float64
dtype: object

print(df.values)

array([[ 1.,  1.,  1.,  1.],
       [ 2.,  2.,  2.,  2.],
       [ 3.,  3.,  3.,  3.],
       [nan, nan, nan,  4.]])

intを指定

空、空文字はnp.nanに変換されてしまうため、intとして読み込みができずエラーが発生する


df = pd.read_csv("sample.csv", dtype=int)

ValueError: Integer column has NA values in column 0

fillna, dropna時の挙動

下記データをfillnaした際の挙動


df = pd.DataFrame(
    {
        # A列: int+None
        "A": [1, 2, 3, None],
        # B列: str+空文字
        "B": ["1", "2", "3", ""],
        # C列: int+np.nan
        "C": [1, 2, 3, np.nan],
        # D列: intのみ
        "D": [1, 2, 3, 4]
    },
    dtype="object"
)

print(df.values)

array([[1, '1', 1, 1],
       [2, '2', 2, 2],
       [3, '3', 3, 3],
       [None, '', nan, 4]], dtype=object)

df.fillna('FILL')を行った場合、Noneとnp.nanの値は変換されるが、空文字はそのままになる


print(df.fillna('FILL'))

      A  B     C  D
0     1  1     1  1
1     2  2     2  2
2     3  3     3  3
3  FILL     FILL  4

print(df.fillna('FILL').values)

array([[1, '1', 1, 1],
       [2, '2', 2, 2],
       [3, '3', 3, 3],
       ['FILL', '', 'FILL', 4]], dtype=object)

dropnaの時の挙動も同じく、np.nan, Noneの含まれる行、列は削除されるが、空文字は欠損値として扱われない。


print(df.dropna(axis=1))

   B  D
0  1  1
1  2  2
2  3  3
3     4

print(df.dropna(axis=1).values)

array([['1', 1],
       ['2', 2],
       ['3', 3],
       ['', 4]], dtype=object)

groupby時の挙動

下記データフレームを用いて検証を行う


df = pd.DataFrame(
    {
        # A列: int+None
        "A": [1, 2, 3, None],
        # B列: str+空文字
        "B": ["1", "2", "3", ""],
        # C列: int+np.nan
        "C": [1, 2, 3, np.nan],
        # D列: intのみ
        "D": [1, 2, 3, 4]
    },
    dtype="object"
)

None, np.nanが含まれる列でgroupbyした場合、None, np.nanの行は無視される（欠損となる）


print(df.groupby("A").max().reset_index())

   A  B  C  D
0  1  1  1  1
1  2  2  2  2
2  3  3  3  3

print(df.groupby("A").max().reset_index().values)

array([[1, '1', 1, 1],
       [2, '2', 2, 2],
       [3, '3', 3, 3]], dtype=object)

print(df.groupby("C").max().reset_index())

   C  A  B  D
0  1  1  1  1
1  2  2  2  2
2  3  3  3  3

print(df.groupby("C").max().reset_index().values)

array([[1, 1, '1', 1],
       [2, 2, '2', 2],
       [3, 3, '3', 3]], dtype=object)

列に空文字が含まれていても無視されない


print(df.groupby("B").max().reset_index())

   B    A    C  D
0     NaN  NaN  4
1  1  1.0  1.0  1
2  2  2.0  2.0  2
3  3  3.0  3.0  3

print(df.groupby("B").max().reset_index().values)

array([[1, 1, '1', 1],
       [2, 2, '2', 2],
       [3, 3, '3', 3]], dtype=object)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up