LoginSignup
2

More than 3 years have passed since last update.

tldr

KggleのZoo Animal ClassificationTabular Zoo Animal Classification - Data Every Day #032に沿ってやっていきます。

実行環境はGoogle Colaboratorです。

インポート

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import sklearn.preprocessing as sp
from sklearn.model_selection import train_test_split

import tensorflow as tf

データのダウンロード

Google Driveをマウントします。

from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

KaggleのAPIクライアントを初期化し、認証します。
認証情報はGoogle Drive内(/content/drive/My Drive/Colab Notebooks/Kaggle)にkaggle.jsonとして置いてあります。

import os
kaggle_path = "/content/drive/My Drive/Colab Notebooks/Kaggle"
os.environ['KAGGLE_CONFIG_DIR'] = kaggle_path

from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate() 

Kaggle APIを使ってデータをダウンロードします。

dataset_id = 'uciml/zoo-animal-classification'
dataset = api.dataset_list_files(dataset_id)
file_name = dataset.files[0].name
file_path = os.path.join(api.get_default_download_dir(), file_name)
file_path
Warning: Looks like you're using an outdated API Version, please consider updating (server 1.5.10 / client 1.5.9)





'/content/zoo.csv'
api.dataset_download_file(dataset_id, file_name, force=True, quiet=False)
100%|██████████| 4.27k/4.27k [00:00<00:00, 1.04MB/s]

Downloading zoo.csv to /content









True

データの読み込み

Padasを使ってダウンロードしてきたCSVファイルを読み込みます。

data = pd.read_csv(file_path)
data
animal_name hair feathers eggs milk airborne aquatic predator toothed backbone breathes venomous fins legs tail domestic catsize class_type
0 aardvark 1 0 0 1 0 0 1 1 1 1 0 0 4 0 0 1 1
1 antelope 1 0 0 1 0 0 0 1 1 1 0 0 4 1 0 1 1
2 bass 0 0 1 0 0 1 1 1 1 0 0 1 0 1 0 0 4
3 bear 1 0 0 1 0 0 1 1 1 1 0 0 4 0 0 1 1
4 boar 1 0 0 1 0 0 1 1 1 1 0 0 4 1 0 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
96 wallaby 1 0 0 1 0 0 0 1 1 1 0 0 2 1 0 1 1
97 wasp 1 0 1 0 1 0 0 0 0 1 1 0 6 0 0 0 6
98 wolf 1 0 0 1 0 0 1 1 1 1 0 0 4 1 0 1 1
99 worm 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 7
100 wren 0 1 1 0 1 0 0 0 1 1 0 0 2 1 0 0 2

101 rows × 18 columns

下準備

不要な列の削除

data = data.drop('animal_name', axis=1)
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   hair        101 non-null    int64
 1   feathers    101 non-null    int64
 2   eggs        101 non-null    int64
 3   milk        101 non-null    int64
 4   airborne    101 non-null    int64
 5   aquatic     101 non-null    int64
 6   predator    101 non-null    int64
 7   toothed     101 non-null    int64
 8   backbone    101 non-null    int64
 9   breathes    101 non-null    int64
 10  venomous    101 non-null    int64
 11  fins        101 non-null    int64
 12  legs        101 non-null    int64
 13  tail        101 non-null    int64
 14  domestic    101 non-null    int64
 15  catsize     101 non-null    int64
 16  class_type  101 non-null    int64
dtypes: int64(17)
memory usage: 13.5 KB

データはとてもきれいです。ほぼやることがない。

分割とスケーリング

y = data['class_type']
X = data.drop('class_type', axis=1)
scaler = sp.MinMaxScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
X
hair feathers eggs milk airborne aquatic predator toothed backbone breathes venomous fins legs tail domestic catsize
0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.50 0.0 0.0 1.0
1 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.50 1.0 0.0 1.0
2 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 0.00 1.0 0.0 0.0
3 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.50 0.0 0.0 1.0
4 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.50 1.0 0.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
96 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.25 1.0 0.0 1.0
97 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.75 0.0 0.0 0.0
98 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.50 1.0 0.0 1.0
99 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.00 0.0 0.0 0.0
100 0.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.25 1.0 0.0 0.0

101 rows × 16 columns

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)

トレーニング

ロジスティック回帰

from sklearn.linear_model import LogisticRegression

log_model = LogisticRegression()
log_model.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

SVM

from sklearn.svm import SVC

svm_model = SVC(C=1.0)
svm_model.fit(X_train, y_train)
SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

ニューラルネット

from sklearn.neural_network import MLPClassifier

nn_model = MLPClassifier(hidden_layer_sizes=(64, 64))
nn_model.fit(X_train, y_train)

/usr/local/lib/python3.6/dist-packages/sklearn/neural_network/_multilayer_perceptron.py:571: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)





MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(64, 64), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

結果

log_acc = log_model.score(X_test, y_test)
svm_acc = svm_model.score(X_test, y_test)
nn_acc = nn_model.score(X_test, y_test)

print('Accuracy Results\n' + '*'*16)
print('Logistic Model:', log_acc)
print('     SVM Model:', svm_acc)
print('      NN Model:', nn_acc)
Accuracy Results
****************
Logistic Model: 0.9354838709677419
     SVM Model: 0.9032258064516129
      NN Model: 1.0

結果としてはニューラルネットが最も高い精度になりました。
ただ、データ自体がとても少ないのでそこは注意が必要だと思いました。

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
2