Open InterpreterでKaggle初心者でもサクッとモデルをつくる

Last updated at 2023-12-11Posted at 2023-12-11

自動でこのコードをつくります

このプロンプトで以下のコードを自動生成します。

text = """
# Read csv data as dataset from '/kaggle/input/titanic/train.csv' as train data and '/kaggle/input/titanic/test.csv' as test data.
# Using the dataset, test the Decision Tree, Random Forest, XGBoost and an ensemble of the three models to see which is the most accurate. 
# The validation should be based on the average of the 4-fold cross-validation results.
# Save model as pickle file
# Save codes as notebook file
"""

生成したコード

from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

# Encode categorical variables
label_encoder = LabelEncoder()
train_data['Sex'] = label_encoder.fit_transform(train_data['Sex'])
test_data['Sex'] = label_encoder.transform(test_data['Sex']) 

# Fill missing values
imputer = SimpleImputer(strategy='mean')
train_data['Age'] = imputer.fit_transform(train_data[['Age']])
test_data['Age'] = imputer.transform(test_data[['Age']])
train_data['Fare'] = imputer.fit_transform(train_data[['Fare']])
test_data['Fare'] = imputer.transform(test_data[['Fare']])

# Split data into features and target
X_train = train_data.drop(['Survived'], axis=1)
y_train = train_data['Survived']
X_test = test_data 



from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import VotingClassifier
import pickle

# Define the models
decision_tree = DecisionTreeClassifier()
random_forest = RandomForestClassifier()
xgboost = xgb.XGBClassifier()

# Create the ensemble model
ensemble_model = VotingClassifier(estimators=[('dt', decision_tree), ('rf', random_forest), ('xgb', xgboost)], voting='hard')

# Perform cross-validation
decision_tree_scores = cross_val_score(decision_tree, X_train, y_train, cv=4)
random_forest_scores = cross_val_score(random_forest, X_train, y_train, cv=4)
xgboost_scores = cross_val_score(xgboost, X_train, y_train, cv=4)
ensemble_scores = cross_val_score(ensemble_model, X_train, y_train, cv=4)

# Calculate average accuracy
decision_tree_avg_accuracy = decision_tree_scores.mean()
random_forest_avg_accuracy = random_forest_scores.mean()
xgboost_avg_accuracy = xgboost_scores.mean()
ensemble_avg_accuracy = ensemble_scores.mean()

# Save the best model as a pickle file
best_model = max(decision_tree_avg_accuracy, random_forest_avg_accuracy, xgboost_avg_accuracy, ensemble_avg_accuracy)
if best_model == decision_tree_avg_accuracy:
    pickle.dump(decision_tree, open('decision_tree_model.pkl', 'wb'))
elif best_model == random_forest_avg_accuracy:
    pickle.dump(random_forest, open('random_forest_model.pkl', 'wb'))
elif best_model == xgboost_avg_accuracy:
    pickle.dump(xgboost, open('xgboost_model.pkl', 'wb'))
else:
    pickle.dump(ensemble_model, open('ensemble_model.pkl', 'wb'))

Kaggleのnotebook内でOpen Interpreterを使ってみる

Open Interpreterを使えば数行のコーディングで推論モデルができます。 この記事では前処理やモデルに詳しくない方でもOpen Interpreterを使って簡単に解析できる方法を紹介します。コードの構造としてはこれくらい簡単です。

import interpreter
text = """
データを解析して
"""
result = interpreter.chat(text)

実際のコンペの動きを仮定して、Kaggleのチュートリアルとも言われるTitanicのコンペでOpen Interpreterを動かしてみます。

こちらからnotebookを作成し、Titanicデータにアクセスできる状態でスタートします。Inputにtrain.csvやtest.csvがある状態です。

Open Interpreterについて

GPT-3.5、GPT-4、Code Llamaなどの大規模な自然言語モデルをベースに開発されたオープンソースツールです。会話するように、自然言語で記述しプログラムを実行することで、web開発やデータ分析などに関する結果が得られます。

簡単なプロンプトでコーディング

パッケージのimportなどの準備段階です。

#必要なパッケージの取得
!pip install open-interpreter
import interpreter
# ユーザーに許可を取りながら実行するか
interpreter.auto_run = True
# モデルの選択。他にもclaude-2やcommand-nightlyなど
interpreter.model = "gpt-3.5-turbo"
# 自身のOpenAIのAPI-keyの入力。こちらで発行します→https://platform.openai.com/api-keys
interpreter.api_key = "YOUR API KEY"

そして、指示内容です。この内容を大規模言語モデルが解釈し、プログラムを作成し実行します。「データを読み込み、決定木分析・ランダムフォレスト・XGboostでアンサンブル学習を行い、k-fold交差検証をし、最も精度の高いモデルとそのコードを保存して」という内容です。

# プロンプト入力例
text = """
# Read csv data as dataset from '/kaggle/input/titanic/train.csv' as train data and '/kaggle/input/titanic/test.csv' as test data.
# Using the dataset, test the Decision Tree, Random Forest, XGBoost and an ensemble of the three models to see which is the most accurate. 
# The validation should be based on the average of the k-fold cross-validation results.
# Save model as pickle file
# Save codes as notebook file
"""

# resultで結果を受け取る
result = interpreter.chat(text)

生成されるコードと上記のプロンプトの対応関係を解説していきます！

解説

データの読み込み

# Read csv data as dataset from '/kaggle/input/titanic/train.csv' as train data and '/kaggle/input/titanic/test.csv' as test data.
ディレクトリやどのファイルが学習データ、テストデータなのかを明記した方が良いです。

生成したコード

import pandas as pd

# Read train data from CSV
train_data = pd.read_csv('/kaggle/input/titanic/train.csv')

# Read test data from CSV
test_data = pd.read_csv('/kaggle/input/titanic/test.csv')

# Read Titanic dataset
これでも読み込むことができる場合がありますが、作業しているのnotebookにあるtrain.csvやtest.csvを解釈しているわけではありません。汎用的にコンペで利用する場合は、ディレクトリを指定するようにしましょう。

学習・検証

# Using the dataset, test the Decision Tree, Random Forest, XGBoost and an ensemble of the three models to see which is the most accurate.
# The validation should be based on the average of the k-fold cross-validation results.
ここでは決定木分析とランダムフォレスト、XGBoostで最も精度の高いモデルを選ぶ指示を出しています。LightBGM、CatBoostなどもOKです。バリデーション手法も明記すると良いです。

生成したコード

from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

# Encode categorical variables
label_encoder = LabelEncoder()
train_data['Sex'] = label_encoder.fit_transform(train_data['Sex'])
test_data['Sex'] = label_encoder.transform(test_data['Sex']) 

# Fill missing values
imputer = SimpleImputer(strategy='mean')
train_data['Age'] = imputer.fit_transform(train_data[['Age']])
test_data['Age'] = imputer.transform(test_data[['Age']])
train_data['Fare'] = imputer.fit_transform(train_data[['Fare']])
test_data['Fare'] = imputer.transform(test_data[['Fare']])

# Split data into features and target
X_train = train_data.drop(['Survived'], axis=1)
y_train = train_data['Survived']
X_test = test_data 


from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import VotingClassifier

# Define the models
decision_tree = DecisionTreeClassifier()
random_forest = RandomForestClassifier()
xgboost = xgb.XGBClassifier()

# Create the ensemble model
ensemble_model = VotingClassifier(estimators=[('dt', decision_tree), ('rf', random_forest), ('xgb', xgboost)], voting='hard')

# Perform cross-validation
decision_tree_scores = cross_val_score(decision_tree, X_train, y_train, cv=4)
random_forest_scores = cross_val_score(random_forest, X_train, y_train, cv=4)
xgboost_scores = cross_val_score(xgboost, X_train, y_train, cv=4)
ensemble_scores = cross_val_score(ensemble_model, X_train, y_train, cv=4)

# Calculate average accuracy
decision_tree_avg_accuracy = decision_tree_scores.mean()
random_forest_avg_accuracy = random_forest_scores.mean()
xgboost_avg_accuracy = xgboost_scores.mean()
ensemble_avg_accuracy = ensemble_scores.mean()

# Using the dataset, make one accurate model.
# The validation method is k-fold cross-validation.
モデルを一つしか試さないなど性能面で問題があったり、何度もエラーに対して修正を加えたりなど、うまく生成しないことがあるので、確定できる要素は具体的に書きましょう。

保存

# Save model as pickle file
# Save codes as notebook file
最終的なコードは.pyや.ipynbで保存しておくことでそれを使って自分でモデルを改善することができます。

生成したコード

import pickle

# Save the best model as a pickle file
best_model = max(decision_tree_avg_accuracy, random_forest_avg_accuracy, xgboost_avg_accuracy, ensemble_avg_accuracy)
if best_model == decision_tree_avg_accuracy:
    pickle.dump(decision_tree, open('decision_tree_model.pkl', 'wb'))
elif best_model == random_forest_avg_accuracy:
    pickle.dump(random_forest, open('random_forest_model.pkl', 'wb'))
elif best_model == xgboost_avg_accuracy:
    pickle.dump(xgboost, open('xgboost_model.pkl', 'wb'))
else:
    pickle.dump(ensemble_model, open('ensemble_model.pkl', 'wb'))

# Save the code as a notebook file
code = '''
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import VotingClassifier
import pickle

# Read train data from CSV
train_data = pd.read_csv('/kaggle/input/titanic/train.csv')

# Read test data from CSV
test_data = pd.read_csv('/kaggle/input/titanic/test.csv')

# Encode categorical variables
label_encoder = LabelEncoder()
train_data['Sex'] = label_encoder.fit_transform(train_data['Sex'])
test_data['Sex'] = label_encoder.transform(test_data['Sex'])

# Fill missing values
imputer = SimpleImputer(strategy='mean')
train_data['Age'] = imputer.fit_transform(train_data[['Age']])
test_data['Age'] = imputer.transform(test_data[['Age']])
train_data['Fare'] = imputer.fit_transform(train_data[['Fare']])
test_data['Fare'] = imputer.transform(test_data[['Fare']])

# Split data into features and target
X_train = train_data.drop(['Survived'], axis=1)
y_train = train_data['Survived']
X_test = test_data

# Define the models
decision_tree = DecisionTreeClassifier()
random_forest = RandomForestClassifier()
xgboost = xgb.XGBClassifier()

# Create the ensemble model
ensemble_model = VotingClassifier(estimators=[('dt', decision_tree), ('rf', random_forest), ('xgb', xgboost)], voting='hard')

# Perform cross-validation
decision_tree_scores = cross_val_score(decision_tree, X_train, y_train, cv=4)
random_forest_scores = cross_val_score(random_forest, X_train, y_train, cv=4)
xgboost_scores = cross_val_score(xgboost, X_train, y_train, cv=4)
ensemble_scores = cross_val_score(ensemble_model, X_train, y_train, cv=4)

# Calculate average accuracy
decision_tree_avg_accuracy = decision_tree_scores.mean()
random_forest_avg_accuracy = random_forest_scores.mean()
xgboost_avg_accuracy = xgboost_scores.mean()
ensemble_avg_accuracy = ensemble_scores.mean()

# Save the best model as a pickle file
best_model = max(decision_tree_avg_accuracy, random_forest_avg_accuracy, xgboost_avg_accuracy, ensemble_avg_accuracy)
if best_model == decision_tree_avg_accuracy:
    pickle.dump(decision_tree, open('decision_tree_model.pkl', 'wb'))
elif best_model == random_forest_avg_accuracy:
    pickle.dump(random_forest, open('random_forest_model.pkl', 'wb'))
elif best_model == xgboost_avg_accuracy:
    pickle.dump(xgboost, open('xgboost_model.pkl', 'wb'))
else:
    pickle.dump(ensemble_model, open('ensemble_model.pkl', 'wb')) 
'''
with open('titanic_modeling.ipynb', 'w') as f:
    f.write(code)

Outputに情報が記録されています。

上手くいかなくても大丈夫

同じプロンプトでもアプローチが微妙に異なり、最終結果までたどり着かない場合もあります。したがって、もしうまくモデルができないときは次の方法でやり直しましょう。

何も変えずそのまま再実行
エラーが出る箇所について詳細に記載し再実行

体感的には日本語でもかなりうまく解釈してくれますが、英語の方が安定感があります。
コンペに挑む際は、うまく得られた汎用的なコードを使ってモデルを改善していくのが良さそうです。
また、生成中にエラーが出ることがありますが、自動でリトライします。
例えば、以下はk-fold(K-分割交差検証)を実施するためにKFoldというメソッドを使おうとしましたが、実行過程でimportエラーが見つかり、書き直しています。

エラーが何度も繰り返され中々終わらないときは一度カーネルを停止した方が良いかもしれません。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up