More than 1 year has passed since last update.

住宅価格の予測について、 TensorFlowの決定木アルゴリズムを使用した分析手法の概要について (House Prices Prediction using TensorFlow Decision Forests)

Last updated at 2024-10-21Posted at 2024-10-15

はじめに

この記事では、TensorFlow Decision Forestsを使用して、House Pricesデータセットでベースラインのランダムフォレストモデルをトレーニングする手順を説明していきます。
そもそも、TensorFlowとはAI開発で使用する機械学習のライブラリで、Googleが開発したものです。
TensorFlowの特徴しては、データの読み込み、前処理、出力処理などをテンソルを使用して計算している部分になります。

はじめに大まかに言えば、コードは次のようになります：

overview.py

import tensorflow_decision_forests as tfdf
import pandas as pd

# データセットの読み込み
dataset = pd.read_csv("project/dataset.csv")

# PandasデータフレームをTensorFlowデータセットに変換
tf_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(dataset, label="my_label")

# ランダムフォレストモデルの作成とトレーニング
model = tfdf.keras.RandomForestModel()
model.fit(tf_dataset)

# モデルの概要を出力
print(model.summary())

決定フォレストは、ランダムフォレストや勾配ブースティングツリーを含むツリーベースの類似のモデルになります。このモデルは、表形式データを扱う場合に最適で、ニューラルネットワークを試す前の強力なベースラインを提供することがよくあります。

ライブラリのインポート

早速、いつも通りライブラリをインポートしていきましょう。

python.py

import tensorflow as tf
import tensorflow_decision_forests as tfdf
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# データ可視化のための設定
%matplotlib inline

TensorFlowとTensorFlow Decision Forestsのバージョンを表示

print("TensorFlow v" + tf.version)
print("TensorFlow Decision Forests v" + tfdf.version)

データセットの読み込み

# データセットのパス
train_file_path = "../input/house-prices-advanced-regression-techniques/train.csv"
dataset_df = pd.read_csv(train_file_path)

# データセットの形状を表示
print("Full train dataset shape is {}".format(dataset_df.shape))

データは81列と1460行で構成されています。以下のコードで、データセットの最初の3エントリを表示して全体の概要を確認します。

# データセットの最初の3行を表示
dataset_df.head(3)
79の特徴量列があり、SalePriceというラベル列で住宅の販売価格を予測します。

Id列の削除
Id列はモデルのトレーニングには不要なので削除します。

python
コードをコピーする
dataset_df = dataset_df.drop('Id', axis=1)
dataset_df.head(3)
特徴量のデータ型を確認するために次のコードを使用します。

python
コードをコピーする
dataset_df.info()
住宅価格の分布
住宅価格がどのように分布しているかを確認します。

python
コードをコピーする
# SalePrice列の基本統計量を表示
print(dataset_df['SalePrice'].describe())

# 住宅価格の分布をプロット
plt.figure(figsize=(9, 8))
sns.distplot(dataset_df['SalePrice'], color='g', bins=100, hist_kws={'alpha': 0.4});
数値データの分布
数値型の特徴量の分布を確認します。まず、データセットのすべてのデータ型をリストし、数値型だけを選択します。

python
コードをコピーする
# データ型のリストを取得し、数値型を選択
list(set(dataset_df.dtypes.tolist()))
df_num = dataset_df.select_dtypes(include=['float64', 'int64'])
df_num.head()
次に、すべての数値型の特徴量の分布をプロットします。

python
コードをコピーする
# 数値型特徴量のヒストグラムを表示
df_num.hist(figsize=(16, 20), bins=50, xlabelsize=8, ylabelsize=8);
データセットの準備
このデータセットは、数値、カテゴリ、欠損特徴量の混合データで構成されています。TF-DFはこれらのすべての特徴量タイプをネイティブにサポートしており、前処理は必要ありません。これがツリーベースのモデルの利点であり、TensorFlowや機械学習に対する優れたエントリーポイントです。

python
コードをコピーする
# データセットをトレーニングセットとテストセットに分割
import numpy as np

def split_dataset(dataset, test_ratio=0.30):
    test_indices = np.random.rand(len(dataset)) < test_ratio
    return dataset[~test_indices], dataset[test_indices]

train_ds_pd, valid_ds_pd = split_dataset(dataset_df)
print("{} examples in training, {} examples in testing.".format(
    len(train_ds_pd), len(valid_ds_pd)))
データセットのTensorFlow形式への変換
PandasデータフレームをTensorFlowデータセット形式に変換する必要があります。TensorFlowデータセットは、高速なデータ読み込みが可能なライブラリであり、GPUやTPUを使ってニューラルネットワークをトレーニングする際に役立ちます。

モデルの選択
TensorFlow Decision Forestsでは、次のツリーベースのモデルを選択できます：

RandomForestModel
GradientBoostedTreesModel
CartModel
DistributedGradientBoostedTreesModel
このコンペティションでは、まずランダムフォレストから始めます。ランダムフォレストは、最もよく知られている決定フォレストのトレーニングアルゴリズムです。

利用可能なすべてのモデルをリスト化

tfdf.keras.get_all_models()
ランダムフォレストモデルの作成

ランダムフォレストモデルの作成

rf = tfdf.keras.RandomForestModel(task=tfdf.keras.Task.REGRESSION)
rf.compile(metrics=["mse"]) # 評価指標としてMSEを設定
モデルのトレーニング

モデルのトレーニング

rf.fit(x=train_ds)


## モデルの可視化
ツリーベースのモデルの利点の一つは、簡単に可視化できることです。デフォルトで使用される木の数は300本です。

モデルの可視化

tfdf.model_plotter.plot_model_in_colab(rf, tree_idx=0, max_depth=3)
Out of Bag (OOB) データおよび検証データセットでの評価


# OOBデータでの評価
logs = rf.make_inspector().training_logs()
plt.plot([log.num_trees for log in logs], [log.evaluation.rmse for log in logs])
plt.xlabel("Number of trees")
plt.ylabel("RMSE (out-of-bag)")
plt.show()

検証データセットでの評価

evaluation = rf.evaluate(x=valid_ds, return_dict=True)
for name, value in evaluation.items():
print(f"{name}: {value:.4f}")


## 提出
最後に、競技用のテストデータで予測を行います。

テストデータを読み込み

test_file_path = "../input/house-prices-advanced-regression-techniques/test.csv"
test_data = pd.read_csv(test_file_path)
ids = test_data.pop('Id')

テストデータをTensorFlowデータセットに変換

test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_data, task=tfdf.keras.Task.REGRESSION)

予測を実行

preds = rf.predict(test_ds)

提出用のデータフレーム作成

output = pd.DataFrame({'Id': ids, 'SalePrice': preds.squeeze()})
output.head()



このノートブックでは、TensorFlow Decision Forestsを使って住宅価格を予測するためのモデルを作成し、その性能を評価しました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up