DataFrameの重複行の削除
DataFarme
を結合して重複があった場合の削除方法
import pandas as pd
df1 = pd.DataFrame([['2022/1/6', '3000', '2400', '240'],
['2022/1/7', '1000', '800', '80'],
['2022/1/8', '3000', '2400', '240'],
['2022/1/9', '1500', '1200', '120']],
columns=['date', 'value', 'cost', 'tax'])
df1['date'] = pd.to_datetime(df1['date']).dt.date
df2 = pd.DataFrame([['2022/1/8', '3000', '2400', '240'],
['2022/1/9', '1500', '1200', '120']],
columns=['date', 'value', 'cost', 'tax'])
df2['date'] = pd.to_datetime(df2['date']).dt.date
df3 = pd.DataFrame([['2022/1/9', '1500', '1200', '120', '15'],
['2022/1/10', '500', '400', '40', '5'],
['2022/1/11', '1000', '800', '80', '10']],
columns=['date', 'value', 'cost', 'tax', 'tax2'])
df3['date'] = pd.to_datetime(df3['date']).dt.date
df = pd.concat([df1, df2, df3], ignore_index=True)
df
date | value | cost | tax | tax2 | |
---|---|---|---|---|---|
0 | 2022-01-06 | 3000 | 2400 | 240 | NaN |
1 | 2022-01-07 | 1000 | 800 | 80 | NaN |
2 | 2022-01-08 | 3000 | 2400 | 240 | NaN |
3 | 2022-01-09 | 1500 | 1200 | 120 | NaN |
4 | 2022-01-08 | 3000 | 2400 | 240 | NaN |
5 | 2022-01-09 | 1500 | 1200 | 120 | NaN |
6 | 2022-01-09 | 1500 | 1200 | 120 | 15 |
7 | 2022-01-10 | 500 | 400 | 40 | 5 |
8 | 2022-01-11 | 1000 | 800 | 80 | 10 |
index「2, 4」と「3, 5, 6」にdateの重複があります。dateの重複を判定したいのでsubset='date'
を引数にします。
subset
の引数がない場合は行のすべてが重複す場合に削除となります。
また、最初を残す(default)か、最後を残すかを選択できます。
最初を残す( keep='first'
)と、index「2」と「3」が残ります。
df.drop_duplicates(subset='date', keep='first')
date | value | cost | tax | tax2 | |
---|---|---|---|---|---|
0 | 2022-01-06 | 3000 | 2400 | 240 | NaN |
1 | 2022-01-07 | 1000 | 800 | 80 | NaN |
2 | 2022-01-08 | 3000 | 2400 | 240 | NaN |
3 | 2022-01-09 | 1500 | 1200 | 120 | NaN |
7 | 2022-01-10 | 500 | 400 | 40 | 5 |
8 | 2022-01-11 | 1000 | 800 | 80 | 10 |
最後を残す( keep='last'
)と、index「4」と「6」が残ります。
df.drop_duplicates(subset='date', keep='last')
date | value | cost | tax | tax2 | |
---|---|---|---|---|---|
0 | 2022-01-06 | 3000 | 2400 | 240 | NaN |
1 | 2022-01-07 | 1000 | 800 | 80 | NaN |
4 | 2022-01-08 | 3000 | 2400 | 240 | NaN |
6 | 2022-01-09 | 1500 | 1200 | 120 | 15 |
7 | 2022-01-10 | 500 | 400 | 40 | 5 |
8 | 2022-01-11 | 1000 | 800 | 80 | 10 |
複数のcolumnで判定した場合はsubset
に複数のcloumnをセットする
df.drop_duplicates(subset=['date', 'tax2'])
date | value | cost | tax | tax2 | |
---|---|---|---|---|---|
0 | 2022-01-06 | 3000 | 2400 | 240 | NaN |
1 | 2022-01-07 | 1000 | 800 | 80 | NaN |
2 | 2022-01-08 | 3000 | 2400 | 240 | NaN |
3 | 2022-01-09 | 1500 | 1200 | 120 | NaN |
6 | 2022-01-09 | 1500 | 1200 | 120 | 15 |
7 | 2022-01-10 | 500 | 400 | 40 | 5 |
8 | 2022-01-11 | 1000 | 800 | 80 | 10 |
重複している行すべてを削除する場合はkeep=False
df.drop_duplicates(subset='date', keep=False)
date | value | cost | tax | tax2 | |
---|---|---|---|---|---|
0 | 2022-01-06 | 3000 | 2400 | 240 | NaN |
1 | 2022-01-07 | 1000 | 800 | 80 | NaN |
7 | 2022-01-10 | 500 | 400 | 40 | 5 |
8 | 2022-01-11 | 1000 | 800 | 80 | 10 |
indexの振り直しもできます。ignore_index=True
df.drop_duplicates(subset='date', keep=False, ignore_index=True)
date | value | cost | tax | tax2 | |
---|---|---|---|---|---|
0 | 2022-01-06 | 3000 | 2400 | 240 | NaN |
1 | 2022-01-07 | 1000 | 800 | 80 | NaN |
2 | 2022-01-10 | 500 | 400 | 40 | 5 |
3 | 2022-01-11 | 1000 | 800 | 80 | 10 |
重複行を判定する場合はduplicated
を使います。引数のsubset
、keep
の使い方は同じです。
df.duplicated(subset='date', keep=False)
0 False
1 False
2 True
3 True
4 True
5 True
6 True
7 False
8 False
dtype: bool
重複行を判定した結果を使っての削除
df[~df.duplicated(subset='date', keep=False)]
date | value | cost | tax | tax2 | |
---|---|---|---|---|---|
0 | 2022-01-06 | 3000 | 2400 | 240 | NaN |
1 | 2022-01-07 | 1000 | 800 | 80 | NaN |
7 | 2022-01-10 | 500 | 400 | 40 | 5 |
8 | 2022-01-11 | 1000 | 800 | 80 | 10 |