More than 3 years have passed since last update.

pandasでカンマ区切りをグループ単位で行展開

pandas

Posted at 2021-06-21

はじめに

user_id	created_at	editor
12345	2020-01-01	atom,vim,vi
23456	2020-01-02	vim,pycharm
のように、editor列をuser_id単位に行を展開したい場合、たまにありますよね。こんな風に

user_id	created_at	editor
12345	2020-01-01	atom
12345	2020-01-01	vim
12345	2020-01-01	vi
23456	2020-01-02	vim
23456	2020-01-02	pycharm

そんな時、pandasを使って可変する場合どうするか。

変形

アプローチは色々ありますが、一つ、stackを使う場合を想定します
列の数が多い場合に関しては、stackを使うのが有用ですが、少ない場合itertools.chainやnumpy.repeatを用いる場合もあります。
コードの可読性は決して高いとは言いませんが。。。

>>> import pandas as pd
>>> df = pd.DataFrame({'user_id':[12345,23456],
...                   "created_at":['2020-01-01','2020-01-02'],
...                   "editor":["atom,vim,vi","vim,pycharm"]})
>>> df = df.set_index(['user_id','created_at']).apply(lambda x : x.str.split(',')).stack(
...     ).apply(pd.Series).stack().unstack(level=2).reset_index(level=[0,1])
>>> df
   user_id  created_at   editor
0    12345  2020-01-01     atom
1    12345  2020-01-01      vim
2    12345  2020-01-01       vi
0    23456  2020-01-02      vim
1    23456  2020-01-02  pycharm

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up