More than 5 years have passed since last update.

[Python/pandas] DataFrameの欠損値をカウントする

Last updated at 2018-05-05Posted at 2018-03-17

前処理を考える手がかりとしてデータの中身をある程度眺める必要があると思いますが、その中でも欠損値の確認はほとんどのケースでやってることだと思います。

ワンライナーでいい感じにやってしまいたかったのでまとめます。

追記 (2018.05.05)

後々調べてみたところ、こちらの方がスッキリしますね。

train.isnull().sum()

欠損値のカウントにはこちら↑を使いましょう。

データセットはwine-reviewsを使っています。

# Data source:
#   https://www.kaggle.com/zynicide/wine-reviews

df_train = pd.read_csv("winemag-data_first150k.csv", index_col=0)
df_train.head()

	country	description	designation	points	price	province	region_1	region_2	variety	winery
0	US	...	Martha's Vineyard	96	235.0	California	Napa Valley	Napa	Cabernet Sauvignon	Heitz
1	Spain	...	Carodorum Selección Especial Reserva	96	110.0	Northern Spain	Toro		Tinta de Toro	Bodega Carmen Rodríguez
2	US	...	Special Selected Late Harvest	96	90.0	California	Knights Valley	Sonoma	Sauvignon Blanc	Macauley
3	US	...	Reserve	96	65.0	Oregon	Willamette Valley	Willamette Valley	Pinot Noir	Ponzi
4	France	...	La Brûlade	95	66.0	Provence	Bandol		Provence red blend	Domaine de la Bégude

※"description" は長い記述の文章なので "..." で置換


df_train.isnull().apply(lambda col: col.value_counts(), axis=0).fillna(0).astype(np.int)

	country	description	designation	points	price	province	region_1	region_2	variety	winery
False	150925	150930	105195	150930	137235	150925	125870	60953	150930	150930
True	5	0	45735	0	13695	5	25060	89977	0	0

df_train.isnull().apply(lambda col: col.value_counts(), axis=0).fillna(0).astype(np.float).apply(lambda col: col/col.sum(), axis=0)

	country	description	designation	points	price	province	region_1	region_2	variety	winery
False	0.9999668720598953	1.0	0.6969787318624527	1.0	0.9092625720532698	0.9999668720598953	0.8339627641953223	0.4038494666401643	1.0	1.0
True	3.312794010468429e-05	0.0	0.3030212681375472	0.0	0.09073742794673027	3.312794010468429e-05	0.16603723580467766	0.5961505333598357	0.0	0.0

isnull ってDataFrame型でも使えるんですね。Seriesのメソッドだとばかり思っていました。