More than 1 year has passed since last update.

ELOコンペにトライ part2_EDA

Last updated at 2023-08-14Posted at 2023-08-08

これは何の記事？

ELO MERCHANT CATEGORY RECOMMENDATIONコンペの学習ログです
探索的データ分析（EDA）の続きを行います
▶前回の記事はこちら

環境

Google Colab
VSCode

記事の流れ

データの概要把握
- 2. testデータ
- 3. historicalデータ
データの前処理

データの概要把握

2. testデータ

各card_idのtargetを推測する

Columns	Description
card_id	Unique card identifier
first_active_month	初回購入月
feature_1	特徴量1
feature_2	特徴量2
feature_3	特徴量3

欠損値、および学習データとの重複がないか確認

# 学習データとテストデータを結合させ、欠損値および重複を確認
all_df = pd.concat([train_df, test_df], sort=False).reset_index(drop=True)

all_df

first_active_monthに欠損値あり、処理処理

all_df.isnull().sum()

card_idに重複はなし

all_df.duplicated(subset="card_id").sum()

3. historicalデータ

Columns	Description
card_id	Card identifier
month_lag	month lag to reference date
purchase_date	Purchase date
authorized_flag	Y' if approved, 'N' if denied
category_3	anonymized category
installments	number of installments of purchase
category_1	anonymized category
merchant_category_id	Merchant category identifier (anonymized)
subsector_id	Merchant category group identifier (anonymized)
merchant_id	Merchant identifier (anonymized)
purchase_amount	Normalized purchase amount
city_id	City identifier (anonymized)
state_id	State identifier (anonymized)
category_2	anonymized category

気になるカラムは以下
・authorized_flag（承認フラグ）：クレカの承認？取引成立ってこと？
・category_1~3：何かしら有効そう

データを軽く見てみると、恐らくmerchant_id × card_idでユニークな購買履歴

# 先頭5行
historical_df.head()

29,112,361行 × 14列あるので、当然重い

# 型と行数確認
historical_df.info()

欠損値確認

# 欠損値確認
historical_df.isnull().sum()

学習およびテストデータのcard_idがすべて含まれているか確認したい
重複なしと確認済のall_dfのレコード数と同じ値が出力されたので、満遍なくcard_idが含まれている模様

# 各dfのcard_idを格納した辞書を作成
card_id = {"all_card_id":all_df["card_id"].unique(),
           "historical_card_id":historical_df["card_id"].unique()}

card_id

# 空集合のsetにそれぞれの値を格納し、両方に存在する値の数を数える
len(set(card_id["all_card_id"])&set(card_id["historical_card_id"]))

Authorized FlagのYとNの数を比較

# authorized_flagごとにカウント
authorized_counts = historical_df["authorized_flag"].value_counts()
authorized_counts

# 棒グラフを作成、authorized_counts.plot(kind='bar')も可
authorized_counts.plot.bar()
plt.xlabel("Authorized Flag")
plt.ylabel('Count')
plt.title('Count of each Authorized Flag')
plt.xticks(rotation=0)

約8.6%が承認されなかった模様

# 比較しにくいので円グラフにする
plt.pie(authorized_counts, labels=authorized_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('Authorized Flag Ratio')
plt.show()

各categoryの比率も確認

# category_1の比率
historical_cat1_counts = historical_df["category_1"].value_counts()

plt.pie(historical_cat1_counts, labels=historical_cat1_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('category_1 Ratio')
plt.show()

# category_2,3も同様に行う

次にinstallmentsを調べてみる

# installmentsのヒストグラム
plt.figure(figsize=(10,4))
plt.hist(historical_df["installments"], bins=50)
plt.title('Histogram of Installments')
plt.xlabel('Installments')
plt.ylabel('Frequency')

# 何もわからないので各値を見る
historical_df["installments"].value_counts()

# どうやら999が悪さしている模様

※ここで、Google colabだとRAM上限に達するので環境をVSCodeに変更。やり方は▶この記事を参考に

# 999を削除あと、-1も意味が分からないので削除
historical_df = historical_df[~historical_df["installments"].isin([-1, 999])]

# 改めてヒストグラム出すと以下

次にpurchase_amountを調べる

# purchase_amountの最小値と最大値の確認
print(historical_df["purchase_amount"].min(),
      historical_df["purchase_amount"].max())

# ヒストグラムにすると分布が確認できなさそうなので、箱ひげ図に
plt.figure(figsize=(12,4))
plt.boxplot(historical_df["purchase_amount"], vert=False, whis=1.5)
plt.title("Box plot of target")
plt.xlabel("purchase amount")
plt.show()

# 外れ値を削除
historical_df = historical_df[historical_df["purchase_amount"]<=6010603]

print(historical_df["purchase_amount"].max())
# いい感じ

再度、箱ひげ図作成

次にmonth_lagを調べる

# month_lagのヒストグラムを求める
plt.figure(figsize=(10,4))
plt.hist(historical_df["month_lag"], bins=50)
plt.xlabel("month_log")
plt.ylabel("Frequency")
plt.show()

データの前処理

train_dfの前処理

targetが-33未満のものを削除

train_df = train_df[train_df["target"]>-33]
print(train_df.shape)

test_dfの前処理

first_active_monthを最頻値で補完

# test_dfのfirst_active_monthを最頻値を算出
mode_test_df = test_df["first_active_month"].mode()[0]
mode_test_df

# 最頻値で埋める
test_df["first_active_month"].fillna(mode_test_df, inplace=True)

historical_dfの前処理

category_2の欠損値を補完する

# いったん最頻値で埋める
mode = historical_df["category_2"].mode()[0]

historical_df["category_2"].fillna(historical_df["category_2"].mode()[0], inplace=True)

merchant_idの欠損値を"missing_id"で埋める

historical_df["merchant_id"].fillna("missing_id", inplace=True)

historical_df.isnull().sum()

～part3に続く～

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up