tldr
KggleのSales of summer clothes in E-commerce Wish
をSummer Clothing Sales Prediction - Data Every Day #026に沿ってやっていきます。
実行環境はGoogle Colaboratorです。
# どんなデータ
服のセールスを予測します。各列はサイズなどの加情報です。
インポート
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.preprocessing as sp
from sklearn.model_selection import train_test_split
import tensorflow as tf
データのダウンロード
Google Driveをマウントします。
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
KaggleのAPIクライアントを初期化し、認証します。
認証情報はGoogle Drive内(/content/drive/My Drive/Colab Notebooks/Kaggle
)にkaggle.json
として置いてあります。
import os
kaggle_path = "/content/drive/My Drive/Colab Notebooks/Kaggle"
os.environ['KAGGLE_CONFIG_DIR'] = kaggle_path
from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()
Kaggle APIを使ってデータをダウンロードします。
dataset_id = 'jmmvutu/summer-products-and-sales-in-ecommerce-wish'
dataset = api.dataset_list_files(dataset_id)
file_name = dataset.files[0].name
file_path = os.path.join(api.get_default_download_dir(), file_name)
file_path
'/content/summer-products-with-rating-and-performance_2020-08.csv'
api.dataset_download_file(dataset_id, file_name, force=True, quiet=False)
100%|██████████| 351k/351k [00:00<00:00, 39.1MB/s]
Downloading summer-products-with-rating-and-performance_2020-08.csv.zip to /content
True
import zipfile
zip_path = '/content/' + file_name + '.zip'
with zipfile.ZipFile(zip_path) as existing_zip:
existing_zip.extractall('/content')
データの読み込み
Padasを使ってダウンロードしてきたCSVファイルを読み込みます。
data = pd.read_csv(file_path)
data
title | title_orig | price | retail_price | currency_buyer | units_sold | uses_ad_boosts | rating | rating_count | rating_five_count | rating_four_count | rating_three_count | rating_two_count | rating_one_count | badges_count | badge_local_product | badge_product_quality | badge_fast_shipping | tags | product_color | product_variation_size_id | product_variation_inventory | shipping_option_name | shipping_option_price | shipping_is_express | countries_shipped_to | inventory_total | has_urgency_banner | urgency_text | origin_country | merchant_title | merchant_name | merchant_info_subtitle | merchant_rating_count | merchant_rating | merchant_id | merchant_has_profile_picture | merchant_profile_picture | product_url | product_picture | product_id | theme | crawl_month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020 Summer Vintage Flamingo Print Pajamas Se... | 2020 Summer Vintage Flamingo Print Pajamas Se... | 16.00 | 14 | EUR | 100 | 0 | 3.76 | 54 | 26.0 | 8.0 | 10.0 | 1.0 | 9.0 | 0 | 0 | 0 | 0 | Summer,Fashion,womenunderwearsuit,printedpajam... | white | M | 50 | Livraison standard | 4 | 0 | 34 | 50 | 1.0 | Quantité limitée ! | CN | zgrdejia | zgrdejia | (568 notes) | 568 | 4.128521 | 595097d6a26f6e070cb878d1 | 0 | NaN | https://www.wish.com/c/5e9ae51d43d6a96e303acdb0 | https://contestimg.wish.com/api/webimage/5e9ae... | 5e9ae51d43d6a96e303acdb0 | summer | 2020-08 |
1 | SSHOUSE Summer Casual Sleeveless Soirée Party ... | Women's Casual Summer Sleeveless Sexy Mini Dress | 8.00 | 22 | EUR | 20000 | 1 | 3.45 | 6135 | 2269.0 | 1027.0 | 1118.0 | 644.0 | 1077.0 | 0 | 0 | 0 | 0 | Mini,womens dresses,Summer,Patchwork,fashion d... | green | XS | 50 | Livraison standard | 2 | 0 | 41 | 50 | 1.0 | Quantité limitée ! | CN | SaraHouse | sarahouse | 83 % avis positifs (17,752 notes) | 17752 | 3.899673 | 56458aa03a698c35c9050988 | 0 | NaN | https://www.wish.com/c/58940d436a0d3d5da4e95a38 | https://contestimg.wish.com/api/webimage/58940... | 58940d436a0d3d5da4e95a38 | summer | 2020-08 |
2 | 2020 Nouvelle Arrivée Femmes Printemps et Été ... | 2020 New Arrival Women Spring and Summer Beach... | 8.00 | 43 | EUR | 100 | 0 | 3.57 | 14 | 5.0 | 4.0 | 2.0 | 0.0 | 3.0 | 0 | 0 | 0 | 0 | Summer,cardigan,women beachwear,chiffon,Sexy w... | leopardprint | XS | 1 | Livraison standard | 3 | 0 | 36 | 50 | 1.0 | Quantité limitée ! | CN | hxt520 | hxt520 | 86 % avis positifs (295 notes) | 295 | 3.989831 | 5d464a1ffdf7bc44ee933c65 | 0 | NaN | https://www.wish.com/c/5ea10e2c617580260d55310a | https://contestimg.wish.com/api/webimage/5ea10... | 5ea10e2c617580260d55310a | summer | 2020-08 |
3 | Hot Summer Cool T-shirt pour les femmes Mode T... | Hot Summer Cool T Shirt for Women Fashion Tops... | 8.00 | 8 | EUR | 5000 | 1 | 4.03 | 579 | 295.0 | 119.0 | 87.0 | 42.0 | 36.0 | 0 | 0 | 0 | 0 | Summer,Shorts,Cotton,Cotton T Shirt,Sleeve,pri... | black | M | 50 | Livraison standard | 2 | 0 | 41 | 50 | NaN | NaN | CN | allenfan | allenfan | (23,832 notes) | 23832 | 4.020435 | 58cfdefdacb37b556efdff7c | 0 | NaN | https://www.wish.com/c/5cedf17ad1d44c52c59e4aca | https://contestimg.wish.com/api/webimage/5cedf... | 5cedf17ad1d44c52c59e4aca | summer | 2020-08 |
4 | Femmes Shorts d'été à lacets taille élastique ... | Women Summer Shorts Lace Up Elastic Waistband ... | 2.72 | 3 | EUR | 100 | 1 | 3.10 | 20 | 6.0 | 4.0 | 2.0 | 2.0 | 6.0 | 0 | 0 | 0 | 0 | Summer,Plus Size,Lace,Casual pants,Bottom,pant... | yellow | S | 1 | Livraison standard | 1 | 0 | 35 | 50 | 1.0 | Quantité limitée ! | CN | youngpeopleshop | happyhorses | 85 % avis positifs (14,482 notes) | 14482 | 4.001588 | 5ab3b592c3911a095ad5dadb | 0 | NaN | https://www.wish.com/c/5ebf5819ebac372b070b0e70 | https://contestimg.wish.com/api/webimage/5ebf5... | 5ebf5819ebac372b070b0e70 | summer | 2020-08 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1568 | Nouvelle Mode Femmes Bohême Pissenlit Imprimer... | New Fashion Women Bohemia Dandelion Print Tee ... | 6.00 | 9 | EUR | 10000 | 1 | 4.08 | 1367 | 722.0 | 293.0 | 185.0 | 77.0 | 90.0 | 0 | 0 | 0 | 0 | bohemia,Plus Size,dandelionfloralprinted,short... | navyblue | S | 50 | Livraison standard | 2 | 0 | 41 | 50 | NaN | NaN | CN | cxuelin99126 | cxuelin99126 | 90 % avis positifs (5,316 notes) | 5316 | 4.224605 | 5b507899ab577736508a0782 | 0 | NaN | https://www.wish.com/c/5d5fadc99febd9356cbc52ee | https://contestimg.wish.com/api/webimage/5d5fa... | 5d5fadc99febd9356cbc52ee | summer | 2020-08 |
1569 | 10 couleurs femmes shorts d'été lacent ceintur... | 10 Color Women Summer Shorts Lace Up Elastic W... | 2.00 | 56 | EUR | 100 | 1 | 3.07 | 28 | 11.0 | 3.0 | 1.0 | 3.0 | 10.0 | 0 | 0 | 0 | 0 | Summer,Panties,Elastic,Lace,Casual pants,casua... | lightblue | S | 2 | Livraison standard | 1 | 0 | 26 | 50 | 1.0 | Quantité limitée ! | CN | sell best quality goods | sellbestqualitygoods | (4,435 notes) | 4435 | 3.696054 | 54d83b6b6b8a771e478558de | 0 | NaN | https://www.wish.com/c/5eccd22b4497b86fd48f16b4 | https://contestimg.wish.com/api/webimage/5eccd... | 5eccd22b4497b86fd48f16b4 | summer | 2020-08 |
1570 | Nouveautés Hommes Siwmwear Beach-Shorts Hommes... | New Men Siwmwear Beach-Shorts Men Summer Quick... | 5.00 | 19 | EUR | 100 | 0 | 3.71 | 59 | 24.0 | 15.0 | 8.0 | 3.0 | 9.0 | 0 | 0 | 0 | 0 | runningshort,Beach Shorts,beachpant,menbeachsh... | white | SIZE S | 15 | Livraison standard | 2 | 0 | 11 | 50 | NaN | NaN | CN | shixueying | shixueying | 86 % avis positifs (210 notes) | 210 | 3.961905 | 5b42da1bf64320209fc8da69 | 0 | NaN | https://www.wish.com/c/5e74be96034d613d42b52dfe | https://contestimg.wish.com/api/webimage/5e74b... | 5e74be96034d613d42b52dfe | summer | 2020-08 |
1571 | Mode femmes d'été sans manches robes col en V ... | Fashion Women Summer Sleeveless Dresses V Neck... | 13.00 | 11 | EUR | 100 | 0 | 2.50 | 2 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0 | 0 | Summer,fashion women,Fashion,Lace,Dresses,Dres... | white | Size S. | 36 | Livraison standard | 3 | 0 | 29 | 50 | NaN | NaN | CN | modai | modai | 77 % avis positifs (31 notes) | 31 | 3.774194 | 5d56b32c40defd78043d5af9 | 0 | NaN | https://www.wish.com/c/5eda07ab0e295c2097c36590 | https://contestimg.wish.com/api/webimage/5eda0... | 5eda07ab0e295c2097c36590 | summer | 2020-08 |
1572 | Pantalon de yoga pour femmes à la mode Slim Fi... | Fashion Women Yoga Pants Slim Fit Fitness Runn... | 7.00 | 6 | EUR | 100 | 1 | 4.07 | 14 | 8.0 | 3.0 | 1.0 | 0.0 | 2.0 | 0 | 0 | 0 | 0 | Summer,Leggings,slim,Yoga,pants,Slim Fit,Women... | red | S | 50 | Livraison standard | 2 | 0 | 41 | 50 | NaN | NaN | CN | AISHOPPINGMALL | aishoppingmall | 90 % avis positifs (7,023 notes) | 7023 | 4.235939 | 5a409cf87b584e7951b2e25f | 0 | NaN | https://www.wish.com/c/5e857321f53c3d2d8f25e7ed | https://contestimg.wish.com/api/webimage/5e857... | 5e857321f53c3d2d8f25e7ed | summer | 2020-08 |
1573 rows × 43 columns
下準備
不要な列の削除
columns_to_drop = [
'title',
'title_orig',
'currency_buyer',
'shipping_option_name',
'urgency_text',
'merchant_title',
'merchant_name',
'merchant_info_subtitle',
'merchant_id',
'merchant_profile_picture',
'product_url',
'product_picture',
'product_id',
'tags',
'has_urgency_banner',
'theme',
'crawl_month',
'origin_country',
]
data = data.drop(columns_to_drop, axis=1)
エンコード
Ordinal Features
data.isnull().sum()
price 0
retail_price 0
units_sold 0
uses_ad_boosts 0
rating 0
rating_count 0
rating_five_count 45
rating_four_count 45
rating_three_count 45
rating_two_count 45
rating_one_count 45
badges_count 0
badge_local_product 0
badge_product_quality 0
badge_fast_shipping 0
product_color 41
product_variation_size_id 14
product_variation_inventory 0
shipping_option_price 0
shipping_is_express 0
countries_shipped_to 0
inventory_total 0
merchant_rating_count 0
merchant_rating 0
merchant_has_profile_picture 0
dtype: int64
size_ordering = ['XXS', 'XS', 'S', 'M', 'L', 'XL', 'XXL']
def ordinal_encode(data, column, ordering):
return data[column].apply(lambda x: ordering.index(x) if x in ordering else None)
data['product_variation_size_id'] = ordinal_encode(data, 'product_variation_size_id', size_ordering)
Onehot Features
def onehot_encode(data, column):
dummies = pd.get_dummies(data[column])
data = pd.concat([data, dummies], axis=1)
data = data.drop(column, axis=1)
return data
data = onehot_encode(data, 'product_color')
(data.dtypes == 'object').sum()
0
欠損値の処理
data.isnull().sum()
price 0
retail_price 0
units_sold 0
uses_ad_boosts 0
rating 0
..
wine 0
wine red 0
winered 0
winered & yellow 0
yellow 0
Length: 125, dtype: int64
null_columns = ['rating_five_count', 'rating_four_count', 'rating_three_count', 'rating_two_count', 'rating_one_count', 'product_variation_size_id']
for column in null_columns:
data[column] = data[column].fillna(data[column].mean())
data.isnull().sum().sum()
0
スケーリング
y = data['units_sold']
X = data.drop(['units_sold'], axis=1)
scaler = sp.MinMaxScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
y.unique()
array([ 100, 20000, 5000, 10, 50000, 1000, 10000, 100000,
50, 1, 7, 2, 3, 8, 6])
yのエンコード
encoder = sp.LabelEncoder()
y = encoder.fit_transform(y)
y_mappings = {index: label for index, label in enumerate(encoder.classes_)}
y_mappings
{0: 1,
1: 2,
2: 3,
3: 6,
4: 7,
5: 8,
6: 10,
7: 50,
8: 100,
9: 1000,
10: 5000,
11: 10000,
12: 20000,
13: 50000,
14: 100000}
トレーニング
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)
model = tf.keras.Sequential([
tf.keras.layers.Dense(16, activation='relu', input_shape=(124,)),
tf.keras.layers.Dense(16, activation='relu'),
tf.keras.layers.Dense(15, activation='softmax'),
])
model.summary()
Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_6 (Dense) (None, 16) 2000
_________________________________________________________________
dense_7 (Dense) (None, 16) 272
_________________________________________________________________
dense_8 (Dense) (None, 15) 255
=================================================================
Total params: 2,527
Trainable params: 2,527
Non-trainable params: 0
_________________________________________________________________
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'],
)
batch_size = 32
epochs = 300
history = model.fit(
X_train,
y_train,
validation_split=0.2,
batch_size=batch_size,
epochs=epochs,
verbose=0,
)
plt.figure(figsize=(14, 10))
epochs_range = range(1, epochs + 1)
train_loss = history.history['loss']
val_loss = history.history['val_loss']
plt.plot(epochs_range, train_loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.xlabel('Epoch')
plt.xlabel('Loss')
plt.show()
Epochがある時点を超えるとValidation Lossが上昇し、過学習が起こっていることが確認できます。
過学習が起きる直前のValidation Lossが最小値のEpochを求めます。
np.argmin(val_loss)
94