More than 1 year has passed since last update.

[kaggle / python] 回帰問題（house prices）の超初歩（２）

Last updated at 2022-06-26Posted at 2021-06-06

前回の続きです。
前回の記事はこちら。

Summary of the previous article（前回のあらまし）

前回は、kaggle の House Prices の課題に対して、
「xgboostのmodel.fitにそのままでは渡せない列を、全て削除する」
という雑な仕事をやってのけました。

そんな雑な仕事で上位68%になんか入ってしまった、、、というところで終わりました。
（せいぜい90%くらいだと思ってました）

Today's result

今回は、LabelEncoderを使って、
「前回削除していた列を、全て使える状態へと置換した上で、xgboostのモデル作成をする」
ここまでやってみました。

その結果、上位 68% から 64% へと上昇しました！(((o(*ﾟ▽ﾟ*)o)))
（総プレイヤー数が増えているので、順位は落ちている）

前回
今回

さて、具体的に何をしたのか見ていきましょう～

Second step

1. source cord

1-0. same as the previous cord

前回と同じソースは、この「1-0.」にまとめておきます(^ワ^*)
やっていることは、

モジュール読み込み
データ読み込み
説明変数と目的変数の分割
学習データとテストデータの分割

です。

# import modules
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from xgboost import XGBClassifier, XGBRegressor

# load data
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
train = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
test = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')

# split data into explanatory variable(説明変数) and response variable(目的変数)
tmp_train_x = train.drop('SalePrice', axis=1)
tmp_train_y = train['SalePrice']

# split data into training data and test data
x_train, x_test, y_train, y_test = \
    train_test_split(tmp_train_x, tmp_train_y, test_size=0.20, random_state=0)

1-1. replace the variable not used in first step (by LabelEncoder)

前回の記事では、以下のmodel.fitに弾かれたカラムを全て学習データから削除していました。
削除したカラムはこれら。

not_expected_type_column_names = train.columns[train.dtypes == 'object']
not_expected_type_column_names

# 表示結果
[
    'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
    'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
    'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd',
    'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
    'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir',
    'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType',
    'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence',
    'MiscFeature', 'SaleType', 'SaleCondition'
]

今回は、これらのカラムも全て学習データとして扱えるように置換してみます。

何も考えなくてもできそうな方法を参考書籍から探してみると、label encodingなるものを発見。
まずはこれから試していこう。 ¹

なんとなく気になったので、どんな文字列データが入っているのか見てみました。（1行目だけ）

print(x_train[not_expected_type_column_names].iloc[0].values)

['RL' 'Pave' nan 'Reg' 'Lvl' 'AllPub' 'Inside' 'Gtl' 'NridgHt' 'Norm'
 'Norm' '1Fam' '1Story' 'Hip' 'CompShg' 'CemntBd' 'CmentBd' 'BrkFace' 'Ex'
 'TA' 'PConc' 'Ex' 'TA' 'Av' 'GLQ' 'Unf' 'GasA' 'Ex' 'Y' 'SBrkr' 'Gd'
 'Typ' 'Gd' 'Attchd' 'Unf' 'TA' 'TA' 'Y' nan nan nan 'New' 'Partial']

うむ、どのカラムも文字列が入っていると思っていたのですが...
nanがありますね...(∩´﹏`∩)

実は最初、nanが学習データに入ったままの状態でlabel encodingにかけようとしたのですが、以下のエラーが発生しました。

---------------------------------------------------------------------------
...(略)...


During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-97-dd5b0af638b6> in <module>
     11     target_all_data_column = df_all_data[col_name]
     12     le = LabelEncoder()
---> 13     le.fit(target_all_data_column)
     14     if col_name == 'MSZoning': print(target_all_data_column.unique())
     15 

...(略)...

/opt/conda/lib/python3.7/site-packages/sklearn/preprocessing/_label.py in _encode(values, uniques, encode, check_unknown)
    112             res = _encode_python(values, uniques, encode)
    113         except TypeError:
--> 114             raise TypeError("argument must be a string or number")
    115         return res
    116     else:

TypeError: argument must be a string or number

LabelEncoder()インスタンスのfit関数に、nan入りのデータを食わせたところで、
TypeError: argument must be a string or numberが発生してます。

これを避けるために、nanを文字列に置換してからlabel encoderにかけます。 ²

from sklearn.preprocessing import LabelEncoder

x_train_label_encoded = x_train.copy()
x_test_label_encoded = x_test.copy()

# ※注意
df_all_data = pd.concat([x_train, x_test, test])

for col_name in not_expected_type_column_names:
    # nanを'NaN'という文字列に置換してます
    target_all_data_column = df_all_data[col_name].fillna('NaN')
    le = LabelEncoder()
    le.fit(target_all_data_column)
    if col_name == 'MSZoning': print(target_all_data_column.unique())

    target_train_column = x_train[col_name].fillna('NaN')
    target_test_column = x_test[col_name].fillna('NaN')
    x_train_label_encoded[col_name] = le.transform(target_train_column)
    x_test_label_encoded[col_name] = le.transform(target_test_column)

x_train_label_encoded.head()

      Id  MSSubClass  MSZoning  LotFrontage  LotArea  Street  Alley  LotShape  ...  MiscFeature  MiscVal  MoSold  YrSold  SaleType  SaleCondition
618  619          20         4         90.0    11694       1      1         3  ...            1        0       7    2007         7              5  
870  871          20         4         60.0     6600       1      1         3  ...            1        0       8    2009         9              4 
92    93          30         4         80.0    13360       1      0         0  ...            1        0       8    2009         9              4  
817  818          20         4          NaN    13265       1      1         0  ...            1        0       7    2008         9              4  
302  303          20         4        118.0    13704       1      1         0  ...            1        0       1    2006         9              4

メモ ~Label Encodingを行う上での注意~

「※注意」と書いたところには少し手こずりました。
問題は、x_testやtestに入っているラベルが、必ずしもx_trainに入っているわけではないということでした。

label encodingを行う上では、存在する全ての種類のラベルをエンコーディングできないと困ります。
もし、x_trainに入っているラベルだけをfitに食べさせてしまうと、x_trainには存在しないが、x_testやtestには存在するラベルが認識できない、というエラーが起きます。

こんなエラーです。
ValueError: y contains previously unseen labels: 'RRAn'

というわけで、label encoderのfit関数には、学習データもテストデータも込みで、全ラベルを食べさせてあげる必要がありました。
落ち着いて考えれば当たり前なのですが、データ分析の経験が浅いこともあり、こんなことにも気付かなかったです(_ _;)

1-2. create model and try prediction !

あとは、1-1でlabel encodingにかけたデータでXGBRegressorを学習させてあげるだけです。
ここは変数名以外は前回と一緒です。

# create model（学習）
model = XGBRegressor(n_estimators=20, random_state=0)
model.fit(x_train_label_encoded, y_train)

# prediction（推論）
predict_result_for_tr_data = model.predict(x_train_label_encoded)
predict_result = model.predict(x_test_label_encoded)

さて、prediction結果を点数にしてみましょう！

# 学習データをpredictした結果
rmse_of_train = np.sqrt(mean_squared_error(predict_result_for_tr_data, y_train))
rmse_of_train

8146.100714171564

# テストデータをpredictした結果
rmse_of_test = np.sqrt(mean_squared_error(predict_result, y_test))
rmse_of_test

33584.266159693645

前回は9283.22...と37010.13...だったので、いずれも誤差の値(RMSE)が小さくなっていていい感じです^^*

1-3 make submission data

最後に、kaggleに提出するデータを作ります。

test_encoded = test.copy()

for col_name in not_expected_type_column_names:
    # ここでも、全データをlabel encoderのfitに食べさせるのを忘れずに！
    target_all_data_column = df_all_data[col_name].fillna('NaN')
    le = LabelEncoder()
    le.fit(target_all_data_column)

    target_test_column = test[col_name].fillna('NaN')
    test_encoded[col_name] = le.transform(target_test_column)

predict_submission = model.predict(test_encoded)
submission = pd.DataFrame(
    {'Id': test['Id'], 'SalePrice': predict_submission}
)
submission.to_csv('submission_new.csv', index=False)

この結果が上位64%でした～(((o(*ﾟ▽ﾟ*)o)))
大して上位に行ったわけではないですが、目に見える結果が出ると嬉しいですね！

Next step

次やりたいことはこの辺の記事の中から選んでいこうと思っています。（適当）

まずは、部屋の総面積を表すカラムを作成したらどうなるか試してみたい。
その次に、重要度の低いカラムを削除した場合に結果がどう変わるかを見ていきたい。

References

統計WEB / 1-5. 説明変数と目的変数
説明変数と目的変数の英語訳は、この記事を見て真似しました。
- explanatory variable: 説明変数
- response variable: 目的変数

こういう単語がわかってると、英語の関連記事が読みやすくなるはず(｀･ω･´)ｷﾘｯ

参考文献
Kaggleで勝つデータ分析の技術

シリーズ一覧

No	New Trial	Score	Link	Note
1	xgboost	0.15663	こちら
2	Label Encoding	0.15160	-	本記事
3	Add and delete column	0.15140	こちら
4	Make integer into categorical	0.14987	こちら
5	One hot Encoding	0.14835	こちら
6	Hyper parameters tuning	0.13948	こちら
7	Logarithimic transformation	0.13347		記事未作成

one hot encodingが使えることは知っていたのですが、列数が増えてしまって手がかかりそうなのと、xgboostしか使わないのであれば、one hot encodingする必要はないと聞いたので、今回は却下 ↩
次の二つの記事を見て、nanを文字列に置換することにしました ①【Teratail】sklearn の Label Encoderでカテゴリカル変数の前処理をしたい、②【stackoverflow】LabelEncoder — TypeError: argument must be a string or number ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up