はじめに
DNNの回帰タスクをやっています.
特徴全部そのまま放り込んだら下から数えて10番くらいでした(´;ω;`)ウゥゥ
Kaggleは前処理コンペみたいな感じありますね~
というかDeepはモデルより前処理がすっごく大事であるという話ですわ.
前処理を制すものは,Deepを制すってね.さぁ,やっていきましょう.
Kaggele上で公開しているコードはこちらから.
Import
必要なライブラリのインポートと,Deviceの指定です.
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import TensorDataset
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
# Validation
from torchvision import datasets
from torch.utils.data.dataset import Subset
import csv
import os
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import seaborn as sn
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
# モデルの保存
import pickle
# グラフのスタイルを指定
plt.style.use('seaborn-darkgrid')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
Read Data
TRAIN_DIR = "/kaggle/input/playground-series-s3e5/train.csv"
TEST_DIR = "/kaggle/input/playground-series-s3e5/test.csv"
TRAIN_DATA = pd.read_csv(TRAIN_DIR)
TEST_DATA = pd.read_csv(TEST_DIR)
print(f"{TRAIN_DATA .info()}\n")
print(f"{TEST_DATA .info()}\n")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2056 entries, 0 to 2055
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 2056 non-null int64
1 fixed acidity 2056 non-null float64
2 volatile acidity 2056 non-null float64
3 citric acid 2056 non-null float64
4 residual sugar 2056 non-null float64
5 chlorides 2056 non-null float64
6 free sulfur dioxide 2056 non-null float64
7 total sulfur dioxide 2056 non-null float64
8 density 2056 non-null float64
9 pH 2056 non-null float64
10 sulphates 2056 non-null float64
11 alcohol 2056 non-null float64
12 quality 2056 non-null int64
dtypes: float64(11), int64(2)
memory usage: 208.9 KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1372 entries, 0 to 1371
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1372 non-null int64
1 fixed acidity 1372 non-null float64
2 volatile acidity 1372 non-null float64
3 citric acid 1372 non-null float64
4 residual sugar 1372 non-null float64
5 chlorides 1372 non-null float64
6 free sulfur dioxide 1372 non-null float64
7 total sulfur dioxide 1372 non-null float64
8 density 1372 non-null float64
9 pH 1372 non-null float64
10 sulphates 1372 non-null float64
11 alcohol 1372 non-null float64
dtypes: float64(11), int64(1)
memory usage: 128.8 KB
None
I see no missing values in this data.
DataAnalyze
TRAIN_DATA.corr()
全部の特徴をそのままDNNにいれちゃえ!ってした結果のScoreは0.042でした.これは上位100%です!!!
…最下位ぢゃん🥺
%20230207----------------------------------------------%
Correlation of data can be checked.
It may be a good idea to select only those with high correlations.
In this case, we will use all of them.
=> This Model's Score is 0.04243(Fucking Bad👎)
%------------------------------------------------------%
%20230207----------------------------------------------%
Watch correspond with quarity in abs.
alcohol(0.48) > sulphates(0.36) > total sulfur dioxide(0.225) > volatile acidity(0.219) > density(0.15) > citric acid(0.14) > fixed acidity(0.07) > free sulfur dioxide(0.06) > residual sugar(0.048) > chlorides(0.046) > pH(0.016)
I use "alcohol to citric acid"
%------------------------------------------------------%
Ceate New Feature
ということで,特徴を組み合わせることで新しい特徴を作成し,予測すべき値(quality)と相関が大きいものを選択しましょう.
ここでは,足し合わせたり,割ることで新しい特徴を作っています.ここは自分で考えてこういう組み合わせしたらいい特徴になりそうだなって作ってるんですねぇ.
これを網羅的に作成して試すのは骨が折れそうですが…
GridSearchみたくうまいこと探せる方法はあるんでしょうか???
教えて!!!!🙇
#reference:https://www.kaggle.com/code/kdmitrie/pgs35-ensemble-of-pytorch-models
TRAIN_DATA["total_acid"] = TRAIN_DATA["fixed acidity"] + TRAIN_DATA['volatile acidity'] + TRAIN_DATA['citric acid']
TRAIN_DATA['acid/density'] = TRAIN_DATA['total_acid'] / TRAIN_DATA['density']
TRAIN_DATA['alcohol_density'] = TRAIN_DATA['alcohol'] * TRAIN_DATA['density']
TRAIN_DATA['sulphate/density'] = TRAIN_DATA['sulphates'] / TRAIN_DATA['density']
TRAIN_DATA['sulphates/acid'] = TRAIN_DATA['sulphates'] / TRAIN_DATA['volatile acidity']
TRAIN_DATA['sulphates/chlorides'] = TRAIN_DATA['sulphates'] / TRAIN_DATA['chlorides']
TRAIN_DATA['sulphates*alcohol'] = TRAIN_DATA['sulphates'] / TRAIN_DATA['alcohol']
TEST_DATA["total_acid"] = TEST_DATA["fixed acidity"] + TEST_DATA['volatile acidity'] + TEST_DATA['citric acid']
TEST_DATA['acid/density'] = TEST_DATA['total_acid'] / TEST_DATA['density']
TEST_DATA['alcohol_density'] = TEST_DATA['alcohol'] * TEST_DATA['density']
TEST_DATA['sulphate/density'] = TEST_DATA['sulphates'] / TEST_DATA['density']
TEST_DATA['sulphates/acid'] = TEST_DATA['sulphates'] / TEST_DATA['volatile acidity']
TEST_DATA['sulphates/chlorides'] = TEST_DATA['sulphates'] / TEST_DATA['chlorides']
TEST_DATA['sulphates*alcohol'] = TEST_DATA['sulphates'] / TEST_DATA['alcohol']
TRAIN_DATA.corr()
相関関係が高い上から8個を採用しました.
%20230207----------------------------------------------%
Watch correspond with quarity in abs.
alcohol(0.48) > alcohol_density(0.48) > sulphate/density(0.368) > sulphates(0.36) > sulphates/acid(0.326) > sulphates/chlorides(0.257) > total sulfur dioxide(0.225) > volatile acidity(0.219) >
density(0.15) > sulphatese*alcohol(0.147) > citric acid(0.14) > fixed acidity(0.07) > free sulfur dioxide(0.06) > acid/density(0.065) > total acid(0.064) > residual sugar(0.048) > chlorides(0.046) > pH(0.016)
I use "alcohol to volatile acidity"
%------------------------------------------------------%
Data Transform to DNN
FEATURE_TRAIN_DATA = TRAIN_DATA[["volatile acidity","total sulfur dioxide","sulphates","alcohol",'alcohol_density','sulphate/density','sulphates/acid','sulphates/chlorides']].values
GROUND_TRUTH = TRAIN_DATA["quality"].values
FEATURE_TEST_DATA = TEST_DATA[["volatile acidity","total sulfur dioxide","sulphates","alcohol",'alcohol_density','sulphate/density','sulphates/acid','sulphates/chlorides']].values
# Normalization
from sklearn import preprocessing
mm = preprocessing.MinMaxScaler()
FEATURE_TRAIN_DATA = mm.fit_transform(FEATURE_TRAIN_DATA)
FEATURE_TEST_DATA = mm.fit_transform(FEATURE_TEST_DATA)
print(f"TRAIN_DATA_SHAPE : {FEATURE_TRAIN_DATA.shape}")
print(f"GROUND_TRUTH_SHAPE : {GROUND_TRUTH.shape}")
print(f"TEST_DATA_SHAPE : {FEATURE_TEST_DATA.shape}")
Model
他のKaggele記事でも説明していますし,この辺の説明は飛ばしますね.
X_tensor = torch.Tensor(FEATURE_TRAIN_DATA)
y_tensor = torch.Tensor(GROUND_TRUTH)
dataset = TensorDataset(X_tensor, y_tensor)
X_tensor = torch.Tensor(FEATURE_TEST_DATA)
test_dataset = TensorDataset(X_tensor)
batch_size = 4
def train_epoch(model, optimizer, criterion, dataloader, device):
train_loss = 0
model.train()
for i, (images, labels) in enumerate(dataloader):
#labels = labels.type(torch.LongTensor)
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
train_loss += loss.item()
train_loss = train_loss / len(dataloader.dataset)
return train_loss
def validation(model, optimizer, criterion, dataloader, device):
model.eval()
val_loss=0
with torch.no_grad():
for i, (images, labels) in enumerate(dataloader):
#labels = labels.type(torch.LongTensor)
images, labels = images.to(device), labels.to(device)
outputs = model(images)
loss = criterion(outputs, labels)
val_loss += loss.item()
val_loss = val_loss / len(dataloader.dataset)
return val_loss
class EarlyStopping:
def __init__(self, patience=30, verbose=False, path='checkpoint_model.pth'):
self.patience = patience
self.verbose = verbose
self.counter = 0
self.best_score = None
self.early_stop = False
self.val_loss_min = np.Inf
self.path = path
def __call__(self, val_loss, model):
score = -val_loss
if self.best_score is None:
self.best_score = score
self.checkpoint(val_loss, model)
elif score < self.best_score:
self.counter += 1
if self.verbose:
print(f'EarlyStopping counter: {self.counter} out of {self.patience}')
if self.counter >= self.patience:
self.early_stop = True
else:
self.best_score = score
self.checkpoint(val_loss, model)
self.counter = 0
def checkpoint(self, val_loss, model):
if self.verbose:
print(f'Validation loss decreased ({self.val_loss_min:.6f} --> {val_loss:.6f}). Saving model ...')
torch.save(model.state_dict(), self.path)
self.val_loss_min = val_loss
def run(num_epochs, optimizer, criterion, device, train_loader, val_loader,model):
train_loss_list = []
val_loss_list = []
earlystopping = EarlyStopping(verbose=True)
for epoch in range(num_epochs):
train_loss = train_epoch(model, optimizer, criterion, train_loader, device)
val_loss = validation(model, optimizer, criterion, val_loader, device)
print(f'Epoch [{epoch+1}], train_Loss : {train_loss:.4f}')
train_loss_list.append(train_loss)
val_loss_list.append(val_loss)
earlystopping(val_loss_list[-1], model)
if earlystopping.early_stop:
print("Early Stopping!")
break
return train_loss_list, val_loss_list
def graph(train_loss_list, val_loss_list):
num_epochs=len(train_loss_list)
fig, ax = plt.subplots(figsize=(4, 3), dpi=100)
ax.plot(range(num_epochs), train_loss_list, c='b', label='train loss')
ax.plot(range(num_epochs), val_loss_list, c='r', label='test loss')
ax.set_xlabel('epoch', fontsize='10')
ax.set_ylabel('loss', fontsize='10')
ax.set_title('training and test loss', fontsize='10')
ax.grid()
ax.legend(fontsize='10')
plt.show()
## CV ALL CONFUSION MATRIX
cv_y_true,cv_y_pred = [],[]
cv = StratifiedKFold(n_splits=5, random_state=42, shuffle=True)
今回モデルはシンプルなDNNです.活性化関数にはReLUを使用します.
class DNN(nn.Module):
def __init__(self):
super(DNN, self).__init__()
self.fc1 = nn.Linear(8,64)
self.fc2 = nn.Linear(64, 8)
self.fc3 = nn.Linear(8, 1)
def forward(self, x):
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
LossにはMAEを使用します.
PytorchにはMAELossが実装されてない!!と思ったのですが,L1Lossがそれにあたるらしいです.
class MAELoss(nn.Module):
def __init__(self):
super(MAELoss, self).__init__()
def forward(self, outputs, targets):
loss = torch.mean(torch.abs(outputs - targets))
return loss
fold_train_list = []
fold_val_list = []
fold_test_list = []
for i,(train_index, test_index) in enumerate(cv.split(FEATURE_TRAIN_DATA,GROUND_TRUTH)):
# モデル指定
model = DNN()
model = model.to(device)
#criterion = nn.MSELoss()
criterion = MAELoss()
optimizer = optim.Adam(model.parameters(),lr=0.0001)
# train/test 分割
cv_train_dataset = Subset(dataset, train_index)
cv_val_dataset = Subset(dataset, test_index)
train_loader = DataLoader(cv_train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(cv_val_dataset, batch_size=batch_size, shuffle=True)
# run
print(f"***FOLD {i}")
train_loss_list, val_loss_list = run(300, optimizer, criterion, device, train_loader, val_loader,model)
model.load_state_dict(torch.load('checkpoint_model.pth'))
# Model Save
ModelSavePath='model'+str(i)+'.pth'
torch.save(model.state_dict(), ModelSavePath)
# PLOT
graph(train_loss_list, val_loss_list)
#print_confusion_matrix(val_loader,model)
# 各実行の最後のLossを保存
fold_train_list.append(train_loss_list[-1])
fold_val_list.append(val_loss_list[-1])
print("-----------------\n")
Output
恒例の出力準備です.
Botingしてます.これ平均取ってから5で割るより,intにしておいて多数決の方がいいのかな?とか記事を書いていて思いました.
気になったら試してみて🥰
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
y_pred = []
for i in range(5):
# Model Load
ModelSavePath='model'+str(i)+'.pth'
model.load_state_dict(torch.load(ModelSavePath))
model.eval()
y_pred_tmp = []
with torch.no_grad():
for table_data in test_loader:
table_data = table_data[0].clone().detach()
table_data = table_data.to(device).detach()
outputs = model(table_data)
for nval in range(len(outputs)):
y_pred_tmp.append(outputs[nval])
y_pred.append(y_pred_tmp)
y_pred_bote = []
for i in range(len(y_pred[0])):
tmp = 0
for j in range(5):
tmp += y_pred[j][i][0]
y_pred_bote.append(tmp/5)
output = [["Id","quality"]]
for i,Test in enumerate(list(TEST_DATA["Id"])):
tmp = []
tmp.append(Test)
#tmp.append(y_pred_bote[i].item())
tmp.append(int(y_pred_bote[i].item()))
output.append(tmp)
output_path = "/kaggle/working/submission.csv"
with open(output_path, 'w', newline='') as file:
writer = csv.writer(file)
writer.writerows(output)
Result
結果は…Score0.37399 全部入れた時よりはよっぽどいいですね.
順位は…616/646(上位96%)
いやしょぼぉ…
さいごに
まぁ今回はDNNのRegressionについて説明したかっただけなので精度については頑張りません.
ですが,これは前処理のしがいがあるタスクとなっているみたいですね.
はじめにも言いましたがDeepは何よりデータの処理が大事です.
いい結果が出たらこっそりそのコードを私に教えてくださいねw