More than 5 years have passed since last update.

Digit Recognizer Sample (A Beginner's Approach to Classification) を読んで勉強する

Kaggle

Posted at 2018-02-18

Digit Recognizer の Tutorial で紹介されている Notebook その 1 (A beginner’s approach to classification) を読んで見て機械学習を勉強する。

Introduction

数字の分類でベストな方法ではない
何をしたらいいかわからない人向け

import pandas as pd
import matplotlib.pyplot as plt, matplotlib.image as mpimg
from sklearn.model_selection import train_test_split
from sklearn import svm
%matplotlib inline

pandas: データ操作・分析のためのライブラリ
matplotlib: グラフなどをプロットするためのライブラリ
scikit-learn: 機械学習のためのライブラリ
svm
- Support Vector Machines
- classification（分類）, regression（回帰）, outliers detection（外れ値検知）に使われる教師あり学習のモデル
- 解説してくれている記事 -> 機械学習入門～ハードマージンSVM編～

Loading the data

pandas の read_csv を使って DataFrame に train.csv を読み込む
- DataFrame 2 次元データを分析するためのデータ構造のよう
- ラベルをもつ行列
学習のために画像とラベルを分ける
モデルの評価のため、train_test_split でデータをトレーニング用とテスト用に分割する
時間の節約のため、使用するデータは 5000 件のみ
数は自分でモデルを見て調整する

labeled_images = pd.read_csv('../input/train.csv')
images = labeled_images.iloc[0:5000,1:]
labels = labeled_images.iloc[0:5000,:1]
train_images, test_images,train_labels, test_labels = train_test_split(images, labels, train_size=0.8, random_state=0)

train_size ではなく、test_size を使えという warn が出るので引数を書き換える

Viewing an Image

numpy array と reshape を使って 1 次元画像データを 2 次元に変換する
- reshape についての記事
matplotlib で画像とラベルをプロットする

i=1
img=train_images.iloc[i].as_matrix()
img=img.reshape((28,28))
plt.imshow(img,cmap='gray')
plt.title(train_labels.iloc[i,0])

i の値を変えることで他のデータをチェックできる

Examining the Pixel Values

画像は黒・白（0,1）ではなく、グレースケール（0-255）
画像の pixel の値をヒストログラムでプロットする

plt.hist(train_images.iloc[i])

Training our model

sklearn.svm で vector classifier を作成する
トレーニングデータを fit メソッドに渡して、モデルを訓練する
テストデータを score メソッドに渡して、モデルを評価する
0 から 1 の浮動小数で精度が返される

clf = svm.SVC()
clf.fit(train_images, train_labels.values.ravel())
clf.score(test_images,test_labels)

0.091 だった

How did our model do?

10% 程度のひどい精度になっているはず
ここでは簡単な方法から改善をしていく
真の黒白を使って画像を単純化していく
pixel の値を 0 以外は全て 1 とする

test_images[test_images>0]=1
train_images[train_images>0]=1

img=train_images.iloc[i].as_matrix().reshape((28,28))
plt.imshow(img,cmap='binary')
plt.title(train_labels.iloc[i])

histogram を見ると 0, 1 になっていることが分かる

plt.hist(train_images.iloc[i])

Retraining our model

黒白に変換したデータを使って再トレーニングする

clf = svm.SVC()
clf.fit(train_images, train_labels.values.ravel())
clf.score(test_images,test_labels)

0.86975 だった

Labelling the test data

competition に提出するために test.csv で予測して、結果を results.csv に出力する

test_data=pd.read_csv('../input/test.csv')
test_data[test_data>0]=1
results=clf.predict(test_data[0:5000])

df = pd.DataFrame(results)
df.index.name='ImageId'
df.index+=1
df.columns=['Label']
df.to_csv('results.csv', header=True)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up