8
10

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

【pandas】DataFrameの部分的に重複している行を抽出する

Last updated at Posted at 2018-02-25

DataFrameの部分的に重複している行を抽出する方法。

元データ
df=pd.DataFrame({'col1':['A1','A2','A1','A3','A3','A4'],
                 'col2':['B1','B2','B2','B2','B2','B4'],
                 'col3':['C1','C2','C3','C2','C3','C4']})
print(df)

# col1 col2 col3
#   A1   B1   C1
#   A2   B2   C2
#   A1   B2   C3
#   A3   B2   C2
#   A3   B2   C3
#   A4   B4   C4

col1列が重複している行を抽出。

col1で抽出
df1 = df[df.duplicated('col1',keep=False)]
print(df1)

# col1 col2 col3
#   A1   B1   C1
#   A1   B2   C3
#   A3   B2   C2
#   A3   B2   C3

col2列とcol3列が重複している行を抽出。

col2,col3で抽出
df2 = df[df.duplicated(['col2','col3'],keep=False)]
print(df2)

# col1 col2 col3
#   A2   B2   C2
#   A1   B2   C3
#   A3   B2   C2
#   A3   B2   C3

見やすくするためにソート。

df2をソート
df3 = df2.sort_values(by=["col2", "col3"], ascending=True)
print(df3)

# col1 col2 col3
#   A2   B2   C2
#   A1   B2   C3
#   A3   B2   C2
#   A3   B2   C3

ちなみに、違う方法でも同様の結果が得られる。

違う方法
df4 = df[df[['col2','col3']].duplicated(keep=False)]
df4 = df4.sort_values(by=["col2", "col3"], ascending=True)
print(df4)

# col1 col2 col3
#   A2   B2   C2
#   A1   B2   C3
#   A3   B2   C2
#   A3   B2   C3
8
10
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
8
10

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?