More than 3 years have passed since last update.

KaggleのCorsesを受講する (矛盾したデータ入力)

Posted at 2020-12-23

入力されたデータに表記ゆれがある場合

１つ１つ手で修正するのが思いつく方法だが、もう少し賢いやり方がある

# 値のバリエーションすべてを洗い出す
countries = professors['Country'].unique()

# アルファベット順にソート
countries.sort()
countries

Out

array([' Germany', ' New Zealand', ' Sweden', ' USA', 'Australia',
       'Austria', 'Canada', 'China', 'Finland', 'France', 'Greece',
       'HongKong', 'Ireland', 'Italy', 'Japan', 'Macau', 'Malaysia',
       'Mauritius', 'Netherland', 'New Zealand', 'Norway', 'Pakistan',
       'Portugal', 'Russian Federation', 'Saudi Arabia', 'Scotland',
       'Singapore', 'South Korea', 'SouthKorea', 'Spain', 'Sweden',
       'Thailand', 'Turkey', 'UK', 'USA', 'USofA', 'Urbana', 'germany'],
      dtype=object)

たとえばGermanyとgermanyなどで表記ゆれがあることが分かる

次にすべてを小文字にして、セルの先頭と末尾の空白を削除

# convert to lower case
professors['Country'] = professors['Country'].str.lower()
# remove trailing white spaces
professors['Country'] = professors['Country'].str.strip()

ファジーマッチングを使用して矛盾したデータ入力を修正する

ファジーマッチング:
対象の文字列に非常に似ているテキスト文字列を自動的に見つけるプロセス。
一般的に、ある文字列を別の文字列に変換する場合、変更する必要のある文字数が少ないほど、文字列は別の文字列に「近い」とみなされる
ファジーマッチングに100%頼ることはできないが通常は少なくとも少しは時間を節約できるでしょう。

Fuzzywuzzyモジュールを利用することで簡単にファジーマッチングが可能

# "south korea"に近いトップ10の単語を取得
matches = fuzzywuzzy.process.extract("south korea", countries, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

matches

Out

[('south korea', 100),
 ('southkorea', 48),
 ('saudi arabia', 43),
 ('norway', 35),
 ('austria', 33),
 ('ireland', 33),
 ('pakistan', 32),
 ('portugal', 32),
 ('scotland', 32),
 ('australia', 30)]

上から2つが"south korea"と同じと思われるので、スコアが47のものを「"south korea"」に置き換える
なんども使う処理になるので、関数化する

def replace_matches_in_column(df, column, string_to_match, min_ratio = 47):
    # 入力値のカラムに格納される値の全種類を取得
    strings = df[column].unique()
    
    # 入力値にマッチする上位10件を取得
    matches = fuzzywuzzy.process.extract(string_to_match, strings, 
                                         limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

    # 比率が90以上のマッチ結果を抽出
    close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]

    # DataFrameから近いマッチ結果の行を取得する
    rows_with_matches = df[column].isin(close_matches)

    # マッチ結果に値を置き換える
    df.loc[rows_with_matches, column] = string_to_match
    
    print("All done!")

関数呼び出し

replace_matches_in_column(df=professors, column='Country', string_to_match="south korea")

エクササイズ（実装例）

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up