0
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

Python DataFrameの重複行の削除

Last updated at Posted at 2022-02-05

DataFrameの重複行の削除

DataFarmeを結合して重複があった場合の削除方法

import pandas as pd
df1 = pd.DataFrame([['2022/1/6', '3000', '2400', '240'],
                    ['2022/1/7', '1000', '800', '80'],
                    ['2022/1/8', '3000', '2400', '240'],
                    ['2022/1/9', '1500', '1200', '120']],
                   columns=['date', 'value', 'cost', 'tax'])
df1['date'] = pd.to_datetime(df1['date']).dt.date
df2 = pd.DataFrame([['2022/1/8', '3000', '2400', '240'],
                    ['2022/1/9', '1500', '1200', '120']],
                   columns=['date', 'value', 'cost', 'tax'])
df2['date'] = pd.to_datetime(df2['date']).dt.date
df3 = pd.DataFrame([['2022/1/9', '1500', '1200', '120', '15'],
                    ['2022/1/10', '500', '400', '40', '5'],
                    ['2022/1/11', '1000', '800', '80', '10']],
                   columns=['date', 'value', 'cost', 'tax', 'tax2'])
df3['date'] = pd.to_datetime(df3['date']).dt.date
df = pd.concat([df1, df2, df3], ignore_index=True)
df
date value cost tax tax2
0 2022-01-06 3000 2400 240 NaN
1 2022-01-07 1000 800 80 NaN
2 2022-01-08 3000 2400 240 NaN
3 2022-01-09 1500 1200 120 NaN
4 2022-01-08 3000 2400 240 NaN
5 2022-01-09 1500 1200 120 NaN
6 2022-01-09 1500 1200 120 15
7 2022-01-10 500 400 40 5
8 2022-01-11 1000 800 80 10

index「2, 4」と「3, 5, 6」にdateの重複があります。dateの重複を判定したいのでsubset='date'を引数にします。
subsetの引数がない場合は行のすべてが重複す場合に削除となります。
また、最初を残す(default)か、最後を残すかを選択できます。
最初を残す( keep='first')と、index「2」と「3」が残ります。

df.drop_duplicates(subset='date', keep='first')
date value cost tax tax2
0 2022-01-06 3000 2400 240 NaN
1 2022-01-07 1000 800 80 NaN
2 2022-01-08 3000 2400 240 NaN
3 2022-01-09 1500 1200 120 NaN
7 2022-01-10 500 400 40 5
8 2022-01-11 1000 800 80 10

最後を残す( keep='last')と、index「4」と「6」が残ります。

df.drop_duplicates(subset='date', keep='last')
date value cost tax tax2
0 2022-01-06 3000 2400 240 NaN
1 2022-01-07 1000 800 80 NaN
4 2022-01-08 3000 2400 240 NaN
6 2022-01-09 1500 1200 120 15
7 2022-01-10 500 400 40 5
8 2022-01-11 1000 800 80 10

複数のcolumnで判定した場合はsubsetに複数のcloumnをセットする

df.drop_duplicates(subset=['date', 'tax2'])
date value cost tax tax2
0 2022-01-06 3000 2400 240 NaN
1 2022-01-07 1000 800 80 NaN
2 2022-01-08 3000 2400 240 NaN
3 2022-01-09 1500 1200 120 NaN
6 2022-01-09 1500 1200 120 15
7 2022-01-10 500 400 40 5
8 2022-01-11 1000 800 80 10

重複している行すべてを削除する場合はkeep=False

df.drop_duplicates(subset='date', keep=False)
date value cost tax tax2
0 2022-01-06 3000 2400 240 NaN
1 2022-01-07 1000 800 80 NaN
7 2022-01-10 500 400 40 5
8 2022-01-11 1000 800 80 10

indexの振り直しもできます。ignore_index=True

df.drop_duplicates(subset='date', keep=False, ignore_index=True)
date value cost tax tax2
0 2022-01-06 3000 2400 240 NaN
1 2022-01-07 1000 800 80 NaN
2 2022-01-10 500 400 40 5
3 2022-01-11 1000 800 80 10

重複行を判定する場合はduplicatedを使います。引数のsubsetkeepの使い方は同じです。

df.duplicated(subset='date', keep=False)
0    False
1    False
2     True
3     True
4     True
5     True
6     True
7    False
8    False
dtype: bool

重複行を判定した結果を使っての削除

df[~df.duplicated(subset='date', keep=False)]
date value cost tax tax2
0 2022-01-06 3000 2400 240 NaN
1 2022-01-07 1000 800 80 NaN
7 2022-01-10 500 400 40 5
8 2022-01-11 1000 800 80 10
0
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?