More than 5 years have passed since last update.

住宅価格の予測を実践する

Last updated at 2018-12-29Posted at 2018-12-29

データの説明

学習用データ：1460
テストデータ：1459

count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

前処理

欠損値の処理を行う
欠損値の多いものをマトリックスから落とし、それ以外の値は中央値で埋める。

y_train = train['SalePrice']
X_train = train.drop(['SalePrice'], axis=1)
X_test = test

Xmat = pd.concat([X_train, X_test])
Xmat = Xmat.drop(['LotFrontage','MasVnrArea','GarageYrBlt'], axis=1)
Xmat = Xmat.fillna(Xmat.median())

Label Encoderを用いてobjectをラベル化した数値に変換する。

from sklearn.preprocessing import LabelEncoder

for i in range(train.shape[1]):
    if train.iloc[:,i].dtypes == object:
        lbl = LabelEncoder()
        lbl.fit(list(train.iloc[:,i].values) + list(test.iloc[:,i].values))
        train.iloc[:,i] = lbl.transform(list(train.iloc[:,i].values))
        test.iloc[:,i] = lbl.transform(list(test.iloc[:,i].values))

よく似た特徴量を組み合わせて新たな特徴量を生成し、分析の効率を上げる。

Xmat["TotalSF"] = Xmat["TotalBsmtSF"] + Xmat["1stFlrSF"] + Xmat["2ndFlrSF"]

データの可視化と考えうるデータ処理

SalePriceの分布を観察すると、正規分布になっていないことがわかる。

対数計算を実施して以下のようにプロットされる正規分布に変換する。

ペアプロットのグラフ

ヒートマップ

以上の分析を踏まえ重要だと思われる変数を抽出する。（なお似たような変数であり、多重共線性があるような変数は相関係数の大きい方を抽出する。）

・OverallQual: Overall material and finish quality
・YeatBuilt: Original construction date
・YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)
・TotalSF: Total square feet
・GrLivArea: Above grade (ground) living area square feet
・FullBath: Full bathrooms above grade
・GarageCars: Size of garage in car capacity

各変数を以下に図示することで外れ値を特定し削除を行う。

特に関係のありそうなOverallQualについて箱ひげ図で分析すると、明らかな線形性が確認できる。

モデル作成

ランダムフォレストを用いて分析を行う。理由としては、散布図から線形性が確認できず、重回帰分析よりも適切であると考えたからである。
加えて、汎化性能が高く、チューニングするパラメーターが少ないなどのメリットもある。

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

# データの分割方法を指定（層化）
kf = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)

# パラメータの設定
param_grid = {'n_estimators': [400, 500, 600],
              'max_depth':  [3, 4]}

# グリッドサーチのモデルのインスタンスを作成
forest_model = GridSearchCV(RandomForestRegressor(), param_grid, cv = kf)

# ランダムフォレストを用いた学習を行う。
forest_model.fit(X, y)

この学習を経てMSEと決定係数を見る。

	Train	Test
MSE	840155614.755	850460488.628
決定係数	0.872	0.850

また予想と結果の残差をプロットしたグラフを示す。

コンペの順位

結果は、スコアが0.17576、ランクは3261/4748

結果と今後の展望

予測の精度が85%と低迷してしまった。また残差を見ると一定の偏りが存在しており、まだ特徴量として抽出すべき特徴量があったことがわかる。
今回欠損値と外れ値の処理がメインの処理になってしまい、変数同士の関係性などに着目しさらなる処理を施せる余地を残してしまった。
今後はデータの前処理において出来ることを増やし、コンペでの順位を上げていくと共に実務上の課題解決にも取り組んでいきたい。

コードはこちらにあります。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up