LoginSignup
1

More than 3 years have passed since last update.

Data Every Day: 不動産データセット

Posted at

tldr

KggleのReal Estate DataSetPredicting Real Estate Value With Random Forests - Data Every Day #086に沿ってやっていきます。

実行環境はGoogle Colaboratorです。

インポート

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import sklearn.preprocessing as sp
from sklearn.model_selection import train_test_split
import sklearn.linear_model as slm
from sklearn.ensemble import RandomForestRegressor

データのダウンロード

Google Driveをマウントします。

from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

KaggleのAPIクライアントを初期化し、認証します。
認証情報はGoogle Drive内(/content/drive/My Drive/Colab Notebooks/Kaggle)にkaggle.jsonとして置いてあります。

import os
kaggle_path = "/content/drive/My Drive/Colab Notebooks/Kaggle"
os.environ['KAGGLE_CONFIG_DIR'] = kaggle_path

from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate() 

Kaggle APIを使ってデータをダウンロードします。

dataset_id = 'arslanali4343/real-estate-dataset'
dataset = api.dataset_list_files(dataset_id)
file_name = dataset.files[0].name
file_path = os.path.join(api.get_default_download_dir(), file_name)
file_path
Warning: Looks like you're using an outdated API Version, please consider updating (server 1.5.10 / client 1.5.9)





'/content/data.csv'
api.dataset_download_file(dataset_id, file_name, force=True, quiet=False)
100%|██████████| 34.6k/34.6k [00:00<00:00, 5.74MB/s]

Downloading data.csv to /content









True

データの読み込み

Padasを使ってダウンロードしてきたCSVファイルを読み込みます。

data = pd.read_csv(file_path)
data
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
506 0.98765 0.0 12.50 0 0.561 6.980 89.0 2.0980 3 320 23.0 396.00 12.00 12.0
507 0.23456 0.0 12.50 0 0.561 6.980 76.0 2.6540 3 320 23.0 343.00 25.00 32.0
508 0.44433 0.0 12.50 0 0.561 6.123 98.0 2.9870 3 320 23.0 343.00 21.00 54.0
509 0.77763 0.0 12.70 0 0.561 6.222 34.0 2.5430 3 329 23.0 343.00 76.00 67.0
510 0.65432 0.0 12.80 0 0.561 6.760 67.0 2.9870 3 345 23.0 321.00 45.00 24.0

511 rows × 14 columns

下準備

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 511 entries, 0 to 510
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     511 non-null    float64
 1   ZN       511 non-null    float64
 2   INDUS    511 non-null    float64
 3   CHAS     511 non-null    int64  
 4   NOX      511 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      511 non-null    float64
 7   DIS      511 non-null    float64
 8   RAD      511 non-null    int64  
 9   TAX      511 non-null    int64  
 10  PTRATIO  511 non-null    float64
 11  B        511 non-null    float64
 12  LSTAT    511 non-null    float64
 13  MEDV     511 non-null    float64
dtypes: float64(11), int64(3)
memory usage: 56.0 KB

欠損値の処理

data['RM'] = data['RM'].fillna(data['RM'].mean())
data.isna().sum().sum()
0

X, Yデータの分割

y = data['MEDV']
X = data.drop('MEDV', axis=1)

スケーリング

scaler = sp.StandardScaler()
X = scaler.fit_transform(X)

トレーニング、テストデータの分割

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)

トレーニング

no_bootstrap_model = RandomForestRegressor(n_estimators=100, criterion='mse', bootstrap=False)
no_bootstrap_model.fit(X_train, y_train)
RandomForestRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)
bootstrap_model = RandomForestRegressor(n_estimators=100, criterion='mse', bootstrap=True)
bootstrap_model.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

結果

print('R^2 without bootstrapping:', no_bootstrap_model.score(X_test, y_test))
print('   R^2 with bootstrapping:', bootstrap_model.score(X_test, y_test))
R^2 without bootstrapping: 0.46518356517728865
   R^2 with bootstrapping: 0.759932629153558

ブートストラップサンプリングを使ったモデルがかなり高い精度をだしました。

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1