tldr
KggleのReal Estate DataSetをPredicting Real Estate Value With Random Forests - Data Every Day #086に沿ってやっていきます。
実行環境はGoogle Colaboratorです。
インポート
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.preprocessing as sp
from sklearn.model_selection import train_test_split
import sklearn.linear_model as slm
from sklearn.ensemble import RandomForestRegressor
データのダウンロード
Google Driveをマウントします。
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
KaggleのAPIクライアントを初期化し、認証します。
認証情報はGoogle Drive内(/content/drive/My Drive/Colab Notebooks/Kaggle
)にkaggle.json
として置いてあります。
import os
kaggle_path = "/content/drive/My Drive/Colab Notebooks/Kaggle"
os.environ['KAGGLE_CONFIG_DIR'] = kaggle_path
from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()
Kaggle APIを使ってデータをダウンロードします。
dataset_id = 'arslanali4343/real-estate-dataset'
dataset = api.dataset_list_files(dataset_id)
file_name = dataset.files[0].name
file_path = os.path.join(api.get_default_download_dir(), file_name)
file_path
Warning: Looks like you're using an outdated API Version, please consider updating (server 1.5.10 / client 1.5.9)
'/content/data.csv'
api.dataset_download_file(dataset_id, file_name, force=True, quiet=False)
100%|██████████| 34.6k/34.6k [00:00<00:00, 5.74MB/s]
Downloading data.csv to /content
True
データの読み込み
Padasを使ってダウンロードしてきたCSVファイルを読み込みます。
data = pd.read_csv(file_path)
data
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222 | 18.7 | 396.90 | 5.33 | 36.2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
506 | 0.98765 | 0.0 | 12.50 | 0 | 0.561 | 6.980 | 89.0 | 2.0980 | 3 | 320 | 23.0 | 396.00 | 12.00 | 12.0 |
507 | 0.23456 | 0.0 | 12.50 | 0 | 0.561 | 6.980 | 76.0 | 2.6540 | 3 | 320 | 23.0 | 343.00 | 25.00 | 32.0 |
508 | 0.44433 | 0.0 | 12.50 | 0 | 0.561 | 6.123 | 98.0 | 2.9870 | 3 | 320 | 23.0 | 343.00 | 21.00 | 54.0 |
509 | 0.77763 | 0.0 | 12.70 | 0 | 0.561 | 6.222 | 34.0 | 2.5430 | 3 | 329 | 23.0 | 343.00 | 76.00 | 67.0 |
510 | 0.65432 | 0.0 | 12.80 | 0 | 0.561 | 6.760 | 67.0 | 2.9870 | 3 | 345 | 23.0 | 321.00 | 45.00 | 24.0 |
511 rows × 14 columns
下準備
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 511 entries, 0 to 510
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CRIM 511 non-null float64
1 ZN 511 non-null float64
2 INDUS 511 non-null float64
3 CHAS 511 non-null int64
4 NOX 511 non-null float64
5 RM 506 non-null float64
6 AGE 511 non-null float64
7 DIS 511 non-null float64
8 RAD 511 non-null int64
9 TAX 511 non-null int64
10 PTRATIO 511 non-null float64
11 B 511 non-null float64
12 LSTAT 511 non-null float64
13 MEDV 511 non-null float64
dtypes: float64(11), int64(3)
memory usage: 56.0 KB
欠損値の処理
data['RM'] = data['RM'].fillna(data['RM'].mean())
data.isna().sum().sum()
0
X, Yデータの分割
y = data['MEDV']
X = data.drop('MEDV', axis=1)
スケーリング
scaler = sp.StandardScaler()
X = scaler.fit_transform(X)
トレーニング、テストデータの分割
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)
トレーニング
no_bootstrap_model = RandomForestRegressor(n_estimators=100, criterion='mse', bootstrap=False)
no_bootstrap_model.fit(X_train, y_train)
RandomForestRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse',
max_depth=None, max_features='auto', max_leaf_nodes=None,
max_samples=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=100, n_jobs=None, oob_score=False,
random_state=None, verbose=0, warm_start=False)
bootstrap_model = RandomForestRegressor(n_estimators=100, criterion='mse', bootstrap=True)
bootstrap_model.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
max_depth=None, max_features='auto', max_leaf_nodes=None,
max_samples=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=100, n_jobs=None, oob_score=False,
random_state=None, verbose=0, warm_start=False)
結果
print('R^2 without bootstrapping:', no_bootstrap_model.score(X_test, y_test))
print(' R^2 with bootstrapping:', bootstrap_model.score(X_test, y_test))
R^2 without bootstrapping: 0.46518356517728865
R^2 with bootstrapping: 0.759932629153558
ブートストラップサンプリングを使ったモデルがかなり高い精度をだしました。