More than 5 years have passed since last update.

データフレーム(DataFrame)の行分割でNULL(欠損値)が含まれるカラムを条件にするときに気をつけること

Last updated at 2019-04-22Posted at 2019-04-22

データ分析の前処理や機械学習において、pandas.DataFrameを何らかの条件で行分割する場合がよくあります。
条件対象のカラムにNULL(欠損値)が存在しない場合はなにも問題ないのですが、カラムにNULLが含まれる場合は気をぬくと行分割がMECEにならなくなるので注意が必要です。
SQL触っている方にはおなじみのture,false,unknownの3つの存在を考慮する必要があります。

TL;DR

NULLを補完した上でDataFrameを分割しましょう
NULLを補完しない場合は、まずはNULLが含まれる行を切り出しましょう

テストデータ作成

以下の簡単なテストデータを作成します。

import pandas as pd
import numpy as np
# 1~10の整数カラム(col1)をもつDataFrameを作成
x = pd.DataFrame(
    {
        'col1':[1,2,3,4,5,6,7,8,9,10]
    }
)
print(x)

output:

カラムにNULLが含まれない場合

まずはカラムにNULLが含まれない場合の条件に応じた行分割をやってみます。
col1について5以上と、5未満という条件でDataFrameを分割します。

# col1の5を閾値としてDataFrameを分割
x_over5 = x[x['col1']>=5]
x_under5 = x[x['col1']<5]
print('col1 >= 5')
print(x_over5)
print('+'*20)
print('col1 < 5')
print(x_under5)
print('+'*20)
print('len(x_over5):', len(x_over5))
print('len(x_under5):', len(x_under5))

output:

col1 >= 5
   col1
4     5
5     6
6     7
7     8
8     9
9    10
++++++++++++++++++++
col1 < 5
   col1
0     1
1     2
2     3
3     4
++++++++++++++++++++
len(x_over5): 6
len(x_under5): 4

この場合は問題なくMECEに分割できています。

カラムにNULLが含まれる場合

次にカラムにNULLが含まれる場合の条件に応じた行分割をやってみます。
xの1行目にnanを代入します。

# 1行目のcol1にnanを代入
x[:1] = np.nan
print(x)

output:

先ほどと同様にcol1について5以上と、5未満という条件でDataFrameを分割します。

# col1の5を閾値としてDataFrameを分割
x_over5 = x[x['col1']>=5]
x_under5 = x[x['col1']<5]
print('col1 >= 5')
print(x_over5)
print('+'*20)
print('col1 < 5')
print(x_under5)
print('+'*20)
print('len(x_over5):', len(x_over5))
print('len(x_under5):', len(x_under5))

output:

col1 >= 5
   col1
4   5.0
5   6.0
6   7.0
7   8.0
8   9.0
9  10.0
++++++++++++++++++++
col1 < 5
   col1
1   2.0
2   3.0
3   4.0
++++++++++++++++++++
len(x_over5): 6
len(x_under5): 3

nanはそもそも数値ではないので、x_over5、x_under5のどちらにも属していません。
nanに対しての論理演算子結果を見てみましょう。

# nanの比較結果
print((x['col1']>=5)[0], (x['col1']<5)[0])

output:

False False

nanに対して数値の論理演算子をあてたときはFalseになります。
したがって、x_over5、x_under5のどちらにも属せないのです。
ここで、SQL触っている方にはおなじみのture,false,unknownの3つの存在を考慮する必要があります。
nanに対してはisnull()、notnull()をあてる必要があります。

# NULLが含まれる行を抽出
x_null = x[x['col1'].isnull()]
# NULL以外かつcol1が5以上を抽出
x_over5 = x[(x['col1']>=5) & (x['col1'].notnull())]
# NULL以外かつcol1が5未満を抽出
x_under5 = x[(x['col1']<5) & (x['col1'].notnull())]
print('col1 is  null')
print(x_null)
print('+'*20)
print('col1 >= 5')
print(x_over5)
print('+'*20)
print('col1 < 5')
print(x_under5)
print('+'*20)
print('len(x_null):', len(x_null))
print('len(x_over5):', len(x_over5))
print('len(x_under5):', len(x_under5))

output:

col1 is  null
   col1
0   NaN
++++++++++++++++++++
col1 >= 5
   col1
4   5.0
5   6.0
6   7.0
7   8.0
8   9.0
9  10.0
++++++++++++++++++++
col1 < 5
   col1
1   2.0
2   3.0
3   4.0
++++++++++++++++++++
len(x_null): 1
len(x_over5): 6
len(x_under5): 3

これでMECEにDataFrameを分割できます。

まとめ

今回は非常に簡単な例でしたが、実際のデータ分析や機械学習ではかなり複雑なDataFrameを扱うことになります。
NULLの扱いは基本中の基本かもしれませんが、複雑な処理を実装している際は結構忘れがちです。
今一度頭の片隅に入れておかねばと思いました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up