More than 3 years have passed since last update.

【pandas】欠損補完しながら重複削除

Posted at 2020-07-05

はじめに

pandasのデータフレームをあるkeyで重複削除する際、同一レコードだと判定されたレコード同士で欠損補完してから重複削除したい場合があります。

import pandas as pd

df = pd.DataFrame({
    'building_name': ['Aビル', 'Aビ ル', None, 'Cビル', 'Bビル', None, 'Dビ ル'],
    'property_scale': ['large', 'large', , 'small', 'small', 'small', 'large'],
    'city_code': [1, 1, 1, 2, 1, 1, 1]
})
df

building_name	property_scale	city_code
Aビル	large	1
Aビル	large	1
None	small	1
Cビル	small	2
Bビル	small	1
None	small	1
Dビル	large	1

補完+重複削除関数

from pandas.core.frame import DataFrame

def drop_duplicates(df: DataFrame, subset: list, fillna: bool = False) -> DataFrame:
    """subsetをkeyに欠損補完してから重複削除.

    Args:
        df (DataFrame): 任意のデータフレーム
        subset (list): 重複削除するkey
        fillna (bool): 重複レコード同士で欠損補完するかどうか. default False.
    
    Returns:
        DataFrame

    """
    group_info = df.groupby(by=subset)
    new_df = pd.concat([
        group_info.get_group(group_name).fillna(method='bfill').fillna(method='ffill')
        for group_name
        in group_info.groups.keys()])
    new_df = new_df.drop_duplicates(subset=subset)
    return new_df

実行

drop_duplicates(df, ['property_scale', 'city_code'], True)

building_name	property_scale	city_code
Aビル	large	1
Bビル	small	1
Cビル	small	2

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up