More than 5 years have passed since last update.

データ分析コンペのKaggleに登録してみた

Posted at 2017-01-09

備忘録変わりに書いていきます。

Kaggle のアカウント登録

Kaggle のサイトから、アカウント登録。
Facebook や Google のアカウントでも良いし、メールアドレスから登録しても良い。

参加するコンペの選択

Competition から選ぶ。

今回は初めてなので、Tutorial の Titanic にした。
乗船者のデータから、乗船者が救助されたかどうかを予測する問題。

データのダウンロード

必要なデータをダウンロードする。
Kaggle ダッシュボード画面左にある Data から。

以下のファイルをダウンロードできる。今回使うのはtrain.csvとtest.csvの２つ。

分析～ Submit までの流れ

Titanic のページ末尾の Interactive Tutorials にしたがって、DataQuest に倣った。
DataQuest では、Titanic を例題に取り上げ、ブラウザ上で Python のコードを書いてクイズに答えながら、分析の進め方から Submission 用のデータの作り方などを知ることができる。
テストデータを使って、予測結果が得られたら、submit する。
Kaggle ダッシュボード画面左にある Make a submission から。
結果を出力した csv ファイルをアップロードしておしまい。

DataQuestで作成したコード

線形回帰とロジスティック回帰の２つのモデルを作成。

titanic.py

# -*- coding: utf-8 -*-
import os
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import KFold

# 学習データの読み込み
titanic = pd.read_csv(u'path/to/train.csv')

### データ処理 ###
# replace na with median value in age
titanic.Age = titanic.Age.fillna(titanic.Age.median())

# replace letters to numerics in Embarked Column
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1

# replace letters to numeric in Embarked Column
titanic.Embarked = titanic.Embarked.fillna('S')
titanic.loc[titanic.Embarked == 'S', 'Embarked'] = 0
titanic.loc[titanic.Embarked == 'C', 'Embarked'] = 1
titanic.loc[titanic.Embarked == 'Q', 'Embarked'] = 2

# The columns we'll use to predict the target
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# Initialize our algorithm class
alg = LinearRegression()
# Generate cross validation folds for the titanic dataset.  It return the row indices corresponding to train and test.
# We set random_state to ensure we get the same splits every time we run this.
kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

predictions = []
for train, test in kf:
    # The predictors we're using the train the algorithm.  Note how we only take the rows in the train folds.
    train_predictors = (titanic[predictors].iloc[train,:])
    # The target we're using to train the algorithm.
    train_target = titanic["Survived"].iloc[train]
    # Training the algorithm using the predictors and target.
    alg.fit(train_predictors, train_target)
    # We can now make predictions on the test fold
    test_predictions = alg.predict(titanic[predictors].iloc[test,:])
    predictions.append(test_predictions)

# The predictions are in three separate numpy arrays.  Concatenate them into one.  
# We concatenate them on axis 0, as they only have one axis.
predictions = np.concatenate(predictions, axis=0)

# Map predictions to outcomes (only possible outcomes are 1 and 0)
predictions[predictions > .5] = 1
predictions[predictions <=.5] = 0

# Calculate accuracy
positive = np.sum(np.equal(np.array(titanic.Survived), predictions))
accuracy = positive/float(len(titanic.Survived))
print(accuracy)

# テストデータの読み込みと前処理
titanic_test = pandas.read_csv("path/to/test.csv")

titanic_test.Age = titanic_test.Age.fillna(titanic.Age.median())
titanic_test.loc[titanic_test["Sex"] == "male", "Sex"] = 0
titanic_test.loc[titanic_test["Sex"] == "female", "Sex"] = 1

titanic_test.Embarked = titanic_test.Embarked.fillna('S')
titanic_test.loc[titanic_test.Embarked == 'S', 'Embarked'] = 0
titanic_test.loc[titanic_test.Embarked == 'C', 'Embarked'] = 1
titanic_test.loc[titanic_test.Embarked == 'Q', 'Embarked'] = 2

titanic_test.Fare = titanic_test.Fare.fillna(titanic.Fare.median())

# Make predictions using the test set.
predictions = alg.predict(titanic_test[predictors])

# Create a new dataframe with only the columns Kaggle wants from the dataset.
submission = pd.DataFrame({
        "PassengerId": titanic_test["PassengerId"],
        "Survived": predictions
    })

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up