More than 1 year has passed since last update.

pandasでTBL差分チェック関数を作成してみた

Last updated at 2023-06-20Posted at 2023-06-20

データの作成

df = pd.read_csv(file_name,encoding="utf-8")
df1 = df[0:11].copy()
df2 = df[0:11].copy()

# 違いを生む
df2.loc[10,"name"] ="John Deo"

df1

id	name	class	mark	gender
0	1	John Deo	Four	75
1	2	Max Ruin	Three	85
2	3	Arnold	Three	55
3	4	Krish Star	Four	60
4	5	John Mike	Four	60
5	6	Alex John	Four	55
6	7	My John Rob	Fifth	78
8	9	Tes Qry	Six	78
9	10	Big John	Four	55
10	11	Ronald	Six	89

df2

id	name	class	mark	gender
0	1	John Deo	Four	75
1	2	Max Ruin	Three	85
2	3	Arnold	Three	55
3	4	Krish Star	Four	60
4	5	John Mike	Four	60
5	6	Alex John	Four	55
6	7	My John Rob	Fifth	78
8	9	Tes Qry	Six	78
9	10	Big John	Four	55
10	11	John Deo	Six	89

df2のnameをJhon Deoにして、差分を生む

この2つのTBLを比較し、差分チェックを行う

データの差分チェック関数


from typing import Union

# python 3.10 added
def diff_check(df1:pd.DataFrame,df2:pd.DataFrame)-> bool | pd.DataFrame:

# python 3.9
def diff_check(df1:pd.DataFrame,df2:pd.DataFrame)-> Union[bool, pd.DataFrame]:
  df_diff = df1[df1["id"].isin(df2["id"])]
  if df_diff.equals(df2) is True:
    return True
  else:
    df_dup = df1.compare(df2)
    return df_dup

    
diff_check(df1,df2)

一致すれば、Trueが返り
不一致なら不一致のdataframeが返ってくる

python3.10以上ならtype hintに上のやり方を試してみてはどうでしょうか

差分結果

	name
	self	other
10	Ronald	John Deo

これを見ると、nameカラムでRonaldとJhon Deoがindex番号10で重複していることがわかる

ちなみに

#nameで重複しているものを取得できたり
df_diff = df1[df1["id"].isin(df2["id"])]

参考：

https://stackoverflow.com/questions/33945261/how-to-specify-multiple-return-types-using-type-hints

https://motamemo.com/python/pandas-tips/pandas-dataframe-change-value/

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up