More than 5 years have passed since last update.

pandasで連続値カウント

Last updated at 2016-05-11Posted at 2016-05-11

目的

データ系列中に何回同じ値が連続しているか調べたい

思想

実直にfor文を回してもいいけどスマートに書きたい

解決手段

pandasのshift, cumsum, groupbyを存分に使う

実装

import numpy as np
import pandas as pd
import random

 def continue_count(df):
    # NaNはグルーピング時に無視されるので適当に入れ替えておく
    df['new_value'] = df['value'].fillna('-')
    # 1個シフトした行が違う値なら +1cumsum, 同じ値なら no cumsum 
    df['changepoint_cumsum'] = (df['new_value'] != df['new_value'].shift()).cumsum()
    # new_valueとcontinue_cumsumでグルーピング
    df_group = df.groupby(['new_value', 'changepoint_cumsum'])
    mi_df = df.set_index(['new_value', 'changepoint_cumsum'])
    # 各グループに入っている値をカウント & 元データとガッチャンコ
    mi_df['continue_count'] = df_group['new_value'].count()
    # インデックスをもどして上げる
    df = mi_df.reset_index()
    return df.drop(['new_value', 'changepoint_cumsum'], axis=1)

if __name__ == '__main__':
    ASSET = [-30, -1, 1, 5, 100, np.nan]
    # "1, 3, 5, nan"からどれか１つを5000回選んでのデータセットを用意
    df = pd.DataFrame({'value': [random.choice(ASSET) for i in range(5000)]})
    print(continue_count(df))

出力は以下のようになる

      value  continue_count
0       1.0               1
1     -30.0               2
2     -30.0               2
3     100.0               1
4       5.0               1
5     -30.0               1
6       NaN               1
7     -30.0               1
8       5.0               2
9       5.0               2
10    100.0               2
11    100.0               2
...

なので、5回以上連続している行を調べたいときは

df[df['continue_count'] >= 5]

で得ることができるようになる。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up