More than 1 year has passed since last update.

Open InterpreterでKaggle初心者でもサクッとモデルをつくる

Last updated at Posted at 2023-12-11



text = """
# Read csv data as dataset from '/kaggle/input/titanic/train.csv' as train data and '/kaggle/input/titanic/test.csv' as test data.
# Using the dataset, test the Decision Tree, Random Forest, XGBoost and an ensemble of the three models to see which is the most accurate. 
# The validation should be based on the average of the 4-fold cross-validation results.
# Save model as pickle file
# Save codes as notebook file


from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

# Encode categorical variables
label_encoder = LabelEncoder()
train_data['Sex'] = label_encoder.fit_transform(train_data['Sex'])
test_data['Sex'] = label_encoder.transform(test_data['Sex']) 

# Fill missing values
imputer = SimpleImputer(strategy='mean')
train_data['Age'] = imputer.fit_transform(train_data[['Age']])
test_data['Age'] = imputer.transform(test_data[['Age']])
train_data['Fare'] = imputer.fit_transform(train_data[['Fare']])
test_data['Fare'] = imputer.transform(test_data[['Fare']])

# Split data into features and target
X_train = train_data.drop(['Survived'], axis=1)
y_train = train_data['Survived']
X_test = test_data 

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import VotingClassifier
import pickle

# Define the models
decision_tree = DecisionTreeClassifier()
random_forest = RandomForestClassifier()
xgboost = xgb.XGBClassifier()

# Create the ensemble model
ensemble_model = VotingClassifier(estimators=[('dt', decision_tree), ('rf', random_forest), ('xgb', xgboost)], voting='hard')

# Perform cross-validation
decision_tree_scores = cross_val_score(decision_tree, X_train, y_train, cv=4)
random_forest_scores = cross_val_score(random_forest, X_train, y_train, cv=4)
xgboost_scores = cross_val_score(xgboost, X_train, y_train, cv=4)
ensemble_scores = cross_val_score(ensemble_model, X_train, y_train, cv=4)

# Calculate average accuracy
decision_tree_avg_accuracy = decision_tree_scores.mean()
random_forest_avg_accuracy = random_forest_scores.mean()
xgboost_avg_accuracy = xgboost_scores.mean()
ensemble_avg_accuracy = ensemble_scores.mean()

# Save the best model as a pickle file
best_model = max(decision_tree_avg_accuracy, random_forest_avg_accuracy, xgboost_avg_accuracy, ensemble_avg_accuracy)
if best_model == decision_tree_avg_accuracy:
    pickle.dump(decision_tree, open('decision_tree_model.pkl', 'wb'))
elif best_model == random_forest_avg_accuracy:
    pickle.dump(random_forest, open('random_forest_model.pkl', 'wb'))
elif best_model == xgboost_avg_accuracy:
    pickle.dump(xgboost, open('xgboost_model.pkl', 'wb'))
    pickle.dump(ensemble_model, open('ensemble_model.pkl', 'wb'))

Kaggleのnotebook内でOpen Interpreterを使ってみる

Open Interpreterを使えば数行のコーディングで推論モデルができます。 この記事では前処理やモデルに詳しくない方でもOpen Interpreterを使って簡単に解析できる方法を紹介します。コードの構造としてはこれくらい簡単です。

import interpreter
text = """
result = interpreter.chat(text)

実際のコンペの動きを仮定して、Kaggleのチュートリアルとも言われるTitanicのコンペでOpen Interpreterを動かしてみます。


Open Interpreterについて

GPT-3.5、GPT-4、Code Llamaなどの大規模な自然言語モデルをベースに開発されたオープンソースツールです。会話するように、自然言語で記述しプログラムを実行することで、web開発やデータ分析などに関する結果が得られます。



!pip install open-interpreter
import interpreter
# ユーザーに許可を取りながら実行するか
interpreter.auto_run = True
# モデルの選択。他にもclaude-2やcommand-nightlyなど
interpreter.model = "gpt-3.5-turbo"
# 自身のOpenAIのAPI-keyの入力。こちらで発行します→https://platform.openai.com/api-keys
interpreter.api_key = "YOUR API KEY"

そして、指示内容です。この内容を大規模言語モデルが解釈し、プログラムを作成し実行します。「データを読み込み、決定木分析・ランダムフォレスト・XGboostでアンサンブル学習を行い、k-fold交差検証をし、最も精度の高いモデルとそのコードを保存して」 という内容です。

# プロンプト入力例
text = """
# Read csv data as dataset from '/kaggle/input/titanic/train.csv' as train data and '/kaggle/input/titanic/test.csv' as test data.
# Using the dataset, test the Decision Tree, Random Forest, XGBoost and an ensemble of the three models to see which is the most accurate. 
# The validation should be based on the average of the k-fold cross-validation results.
# Save model as pickle file
# Save codes as notebook file

# resultで結果を受け取る
result = interpreter.chat(text)




  1. 何も変えずそのまま再実行
  2. エラーが出る箇所について詳細に記載し再実行




