More than 5 years have passed since last update.

pandasでカテゴリごとのone-hotエンコーディング

Posted at 2018-11-26

Abstract

pandasでカテゴリごとに最大値、最小値を抽出するのは、難しくないです。
今回は、最大値、最小値にラベルをつけるスクリプト考えていて、結構長めにつまったのでメモを残します。

Situation

idx	category_id	id	incremental
0	1	1	1
1	1	2	2
2	1	3	3
3	2	4	1
4	2	5	2
5	2	6	3
...	...	...	...

みたいなテーブルを考えます。
この時、以下の問いを考えます。

カテゴリごとの最大値、最小値に対してフラグを持たせる

最終的には、こんな感じに慣ればいいです。

idx	category_id	id	incremental	is_first	is_last
0	1	1	1	1	0
1	1	2	2	0	0
2	1	3	3	0	1
3	2	4	1	1	0
4	2	5	2	0	0
5	2	6	3	0	1
...	...	...	...	...	...

Script

csvを読むとして書きます。

import pandas as pd 

df = pd.read_csv('file_name')

df['is_first'] = np.where(df.index.isin(df.groupby('category_id').incremental.idxmin()), 1, 0)
df['is_last']  = np.where(df.index.isin(df.groupby('category_id').incremental.idxmax()), 1, 0)

カテゴリが増えてもとりあえず、対応はできます。

TODO

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

が出ます。現状、解決策見つかっていないので改善の余地ありです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up