More than 3 years have passed since last update.

初学者のSIGNATE_宿泊の価格予測

Last updated at 2021-09-29Posted at 2021-09-29

この記事について

Aidemyの卒業ブログとして記載しています。
覚えたての手法を実践することが目的のため、分析手法の妥当性、精度についてはいったん保留しています。
記事はSIGNATEの情報公開ポリシーに沿って公開しているつもりですが、問題があれば削除します。

データについて

SIGNATEの【練習問題】民泊サービスの宿泊価格予測のデータを利用しています。特徴量のデータ詳細は上記ページを参照ください。

欠損値あり、半分以上カテゴリデータとなってます。
どの特徴量を使うか、欠損値の埋め方、カテゴリデータの変換方法を考えないとですね。。

<class 'pandas.core.frame.DataFrame'>
Int64Index: 55583 entries, 0 to 55582
Data columns (total 28 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   accommodates            55583 non-null  int64  
 1   amenities               55583 non-null  object 
 2   bathrooms               55436 non-null  float64
 3   bed_type                55583 non-null  object 
 4   bedrooms                55512 non-null  float64
 5   beds                    55487 non-null  float64
 6   cancellation_policy     55583 non-null  object 
 7   city                    55583 non-null  object 
 8   cleaning_fee            55583 non-null  object 
 9   description             55583 non-null  object 
 10  first_review            43675 non-null  object 
 11  host_has_profile_pic    55435 non-null  object 
 12  host_identity_verified  55435 non-null  object 
 13  host_response_rate      41879 non-null  object 
 14  host_since              55435 non-null  object 
 15  instant_bookable        55583 non-null  object 
 16  last_review             43703 non-null  object 
 17  latitude                55583 non-null  float64
 18  longitude               55583 non-null  float64
 19  name                    55583 non-null  object 
 20  neighbourhood           50423 non-null  object 
 21  number_of_reviews       55583 non-null  int64  
 22  property_type           55583 non-null  object 
 23  review_scores_rating    43027 non-null  float64
 24  room_type               55583 non-null  object 
 25  thumbnail_url           49438 non-null  object 
 26  zipcode                 54867 non-null  object 
 27  y                       55583 non-null  float64
dtypes: float64(7), int64(2), object(19)
memory usage: 12.3+ MB

利用する特徴量の検討

数値データの傾向確認
まずは簡単に確認できる数値データの傾向を確認。
多重共線性が発生しそうな説明変数同士の強い相関や、問題になりそうな外れ値はありませんでした。（多分）

なお、正解ラベル（"y"）と相関の低い特徴量については、以下の対応方針を立てた。

・緯度／経度（'latitude', 'longitude'）については、2つを合成し、新たな特徴量を作成
・レビュー情報（'number_of_reviews', 'review_scores_rating'）についても、上記同様、合成により新たな特徴量を作成

temp = ['accommodates', 'bathrooms', 'bedrooms', 'beds', 'latitude',
       'longitude', 'number_of_reviews', 'review_scores_rating', 'y']

# 統計量
df.describe()

# 相関散布図
sns.pairplot(df[temp])

# 相関行列
df[temp].corr()

# 外れ値確認
df[temp].boxplot()

参考）正解ラベル"y"と説明変数の散布図

カテゴリデータの傾向確認
カテゴリーデータの統計量より以下の対応方針を検討。

・ユニークデータ数の少ない 'host_identity_verified', 'room_type', 'cancellation_policy' をワンホットラベリング
・"amenities"には{}内に様々な情報が入っているため、分解して、意味ありげな要素で特徴量作成にチャレンジ

# カテゴリデータの統計量把握
df.describe(include="O")

# amenitiesデータのサンプル
{TV,"Wireless Internet",Kitchen,"Free parking on premises",Washer,Dryer,"Smoke detector"}

新たな特徴量の作成

緯度／経度（'latitude', 'longitude'）の合成
実のところ、ロケーションはカテゴリデータに"city"があり、6クラスに分けられるが、もっと細かく分けた方が精度が上がるのでは？と言う想定から、'latitude', 'longitude'を20分割するとどうなるかを確認。
20分割してもそれほど細かく分かれないようなので、このまま'latitude', 'longitude'を合成してみる。

# ヒストグラムで20分割時の傾向把握
df[["latitude","longitude"]].hist(bins=20)

# 20分割したデータでカラム作成
# 両データを合成しても判別できるよう、"latitude"はラベルを1~20とし、"longitude"は100~2000の100単位でラベルを作成
df["_lati"] = pd.cut(df["latitude"],20,labels=range(1,21))
df["_long"] = pd.cut(df["longitude"],20,labels=range(100,2001,100))
# クロス集計
pd.crosstab(df["_lati"], df["_long"])

特徴量の合成にはラベルをintに変換して加算。
後ほどワンホットラベリングのためにdtypeを"object"に変換しておく。

df["location"] = np.array([int(i) for i in df["_lati"].values]) + np.array([int(j) for j in df["_long"].values])
df = df.astype({'location': object})

レビュースコアの合成
レビュー数（"number_of_reviews"）はレビュースコアの信頼度と考え対数化。
対数化したレビュー数とレビュースコア（"review_scores_rating"）を乗算し特徴量を作成。
欠損値は0で埋める。

df["_rev_score"] = df["number_of_reviews"].map(lambda x: np.log(x)) * df["review_scores_rating"]
df["_rev_score"] = df["_rev_score"].fillna(0)

アメニティから特徴量抽出
"amenities"から要素１つづつ抜き出し、文字列となっている{}内からさらに要素を抽出し辞書化。
130個の要素を確認。
その中から一先ずは感覚で"Wireless Internet","Air conditioning","24-hour check-in","Pool"を選択。

import re
from collections import defaultdict

dic = defaultdict(int)

for _ in range(df["amenities"].shape[0]):
  keys = re.findall(r'\{(.*)\}', df.loc[_,"amenities"])[0].split(",")
  for key in keys:
    dic[key] += 1

上記で選択した"Wireless Internet","Air conditioning","24-hour check-in","Pool"でカラムを作成し、ワンホットエンコーディング。

sele = ["\"Wireless Internet\"","\"Air conditioning\"","\"24-hour check-in\"","Pool"]

df["Wireless Internet"] = 0
df["Air conditioning"] = 0
df["24-hour check-in"] = 0
df["Pool"] = 0

row_num = 0

for _ in range(df["amenities"].shape[0]):
  keys = re.findall(r'\{(.*)\}', df.loc[_,"amenities"])[0].split(",")
  if "\"Wireless Internet\"" in keys:
    df.loc[row_num, "Wireless Internet"] = 1
  elif "\"Air conditioning\"" in keys:
    df.loc[row_num, "Air conditioning"] = 1
  elif "\"24-hour check-in\"" in keys:
    df.loc[row_num, "24-hour check-in"] = 1
  elif "Pool" in keys:
    df.loc[row_num, "Pool"] = 1
  row_num += 1

欠損値の補完

特徴量として選択した数値データにいくつか欠損値があるため、補完方法を検討。

 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   accommodates  55583 non-null  int64  
 1   bathrooms     55436 non-null  float64
 2   bedrooms      55512 non-null  float64
 3   beds          55487 non-null  float64

冒頭で確認した統計量、相関係数より、欠損値のある特徴量は"accommodates"と相関が強く出ているため、"accommodates"でグループ化し傾向確認。

temp = ['accommodates', 'bathrooms', 'bedrooms', 'beds']

# 平均値確認
df[temp].groupby("accommodates").mean()
# 中央値確認
df[temp].groupby("accommodates").median()

基本的には中央値で補完すれば問題なさそうだが、中央値が取れない要素が数個あるため、当該要素は"accommodates"の半分の値で補完。

pd.options.display.max_rows = 300
temp = ['accommodates', 'bathrooms', 'bedrooms', 'beds']

# 欠損値を抽出したデータフレームを作成
test = df.loc[df[temp].isnull().any(axis=1),temp]

# bathroomsの欠損箇所を補完
test['_bathrooms'] = test.groupby(["accommodates"])["bathrooms"].transform(lambda x: x.fillna(x.median()))
test.loc[test["_bathrooms"].isna(),"_bathrooms"] = test.loc[test["_bathrooms"].isna(), "accommodates"] / 2

# bedroomsの欠損箇所を補完
test['_bedrooms'] = test.groupby(["accommodates"])["bedrooms"].transform(lambda x: x.fillna(x.median()))
test.loc[test["_bedrooms"].isna(),"_bedrooms"] = test.loc[test["_bedrooms"].isna(), "accommodates"] / 2

# bedsの欠損箇所を補完
test['_beds'] = test.groupby(["accommodates"])["beds"].transform(lambda x: x.fillna(x.median()))
test.loc[test["_beds"].isna(), "_beds"] = test.loc[test["_beds"].isna(), "accommodates"]

# 上記で作成した補完値で元データを補完
df.loc[df["bathrooms"].isna() ,"bathrooms"] = test['_bathrooms']
df.loc[df["bedrooms"].isna() ,"bedrooms"] = test['_bedrooms']
df.loc[df["beds"].isna() ,"beds"] = test['_beds']

回帰分析

分析用データ作成
利用する特徴量でトレーニングデータを作成し、カテゴリーデータをワンホットエンコーディング。

temp = ['accommodates', 'bathrooms', 'bedrooms', 'beds', '_rev_score', 
        'location', 'host_identity_verified', 'room_type', 'cancellation_policy','y',
        "Wireless Internet", "Air conditioning", "24-hour check-in", "Pool"]
df_test = pd.get_dummies(df[temp])

分析
トレーニングデータをKFoldで分割し分析。
LinearRegressionのハイパーパラメーター調整はいったんなしで実行。
スコアはそれほど悪くなさそうです。

from sklearn.model_selection import KFold, train_test_split
from sklearn.linear_model import LinearRegression

r = list(df_test.columns)
r.remove("y")

X_train, X_test, y_train, ytest = train_test_split(df_test[r], df_test["y"], random_state=7)

cv = KFold(n_splits=10, random_state=42, shuffle=True)
acc_results = []

for trn_index, val_index in cv.split(X_train):
  X_trn, X_val = X_train.iloc[trn_index], X_train.iloc[val_index]
  y_trn, y_val = y_train.iloc[trn_index], y_train.iloc[val_index]

  model = LinearRegression()
  model.fit(X_trn, y_trn)
  pred = model.predict(X_val)
  acc = np.sqrt(sum((y_val - pred)**2)/len(y_val))
  acc_results.append(acc)

np.mean(acc_results)

---結果---
129.30577249199695

傾きと切片の確認
頑張って作った特徴量は、レビュースコア（"_rev_score"）以外はそれなりに計算に影響があったようです。

print(pd.DataFrame(model.coef_, index=r))
print(model.intercept_)

---結果---
accommodates                          19.018487
bathrooms                             68.747517
bedrooms                              36.672900
beds                                 -11.329242
_rev_score                            -0.115965
Wireless Internet                    -15.173891
Air conditioning                      -2.918114
24-hour check-in                     -11.302865
Pool                                 -52.899458
location_110                          76.621340
location_201                         -14.307127
location_202                          -4.965153
location_203                         -30.172658
location_204                          -3.022386
location_1419                        -27.533552
location_1420                        -30.521696
location_1813                         48.595823
location_1916                        -39.833978
location_1917                         14.953999
location_2020                         10.185389
host_identity_verified_f              -2.579610
host_identity_verified_t              -8.606028
room_type_Entire home/apt             60.211179
room_type_Private room               -14.672023
room_type_Shared room                -45.539156
cancellation_policy_flexible         -91.104716
cancellation_policy_moderate        -102.609632
cancellation_policy_strict           -93.470902
cancellation_policy_super_strict_30  -76.781232
cancellation_policy_super_strict_60  363.966482

84.5255962833438

推測
別に配布されてるtest.csvをpredictして提出してます。
結果は　170.8292272　でした。
トレーニングデータの分析結果とかなり乖離があるので、、もっと追い込めそうですね。。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up