More than 3 years have passed since last update.

【SIGANTE】画像ラベリング（１０種類）①データの下準備

Last updated at 2020-12-17Posted at 2020-12-13

全体の構成

①データの下準備

②画像データの拡張（水増し）

③転移学習

環境情報

Python 3.6.5
tensorflow 2.3.1

【SIGANTE】画像ラベリング（１０種類）について

画像データに対して、10種類のラベルの1つを割り当てるモデルを作成します。

学習データサンプル数：5000

以下リンク
https://signate.jp/competitions/133

データの下準備

ライブラリーの準備

python.py

import numpy as np
import pandas as pd
# 画像処理
from PIL import Image
# 画像を読み込む用
import glob

# データの分割
from sklearn.model_selection import train_test_split
# カテゴリー変数へ変換
from tensorflow.keras.utils import to_categorical

ラベルデータの読み込み

python.py

# データの読み込み
train_Y = pd.read_csv('train_master.tsv', delimiter='\t')
# ラベル名の削除
train_Y = train_Y.drop('file_name', axis=1)
# カテゴリー変数へ変換
Y = to_categorical(train_Y)
print(Y.shape)

データ数:5000, 種類:10
(5000, 10)

画像データの読み込み

python.py

# train_imagesディレクトリ内のデータをリストに格納
train_file = glob.glob('train_images/t*')
len(train_file)

データ数：5000
5000

ここで一つ問題が発生。
globを使ってtrain_imagesディレクトリ内のデータを取得したのですが、順番がバラバラになっているため、番号順に並べ替える必要があります。。。

python.py

# 番号順に並び替える関数
import re
from collections import OrderedDict

def sortedStringList(array=[]):
    sortDict=OrderedDict()
    for splitList in array:
        sortDict.update({splitList:[int(x) for x in re.split("(\d+)",splitList)if bool(re.match("\d*",x).group())]})
    return [sortObjKey for sortObjKey,sortObjValue in sorted(sortDict.items(), key=lambda x:x[1])]

sort_file = sortedStringList(train_file)
sort_file[:5]

並び替え後のリスト内
['train_images/train_0.jpg',
 'train_images/train_1.jpg',
 'train_images/train_2.jpg',
 'train_images/train_3.jpg',
 'train_images/train_4.jpg',]

画像の前処理

python.py

X = []
for image in sort_file:
    # 画像ファイルの読み込み
    image = load_img(image) 
    # 画像ファイルのnumpy化 + 正規化
    image = img_to_array(image) /255.0
    # リストへ格納
    X.append(image)

# リストデータをnumpy化
X_np = np.array(X)
print(X_np.shape)
print(X_np.dtype)

データ数:5000, 96*96のカラー画像
(5000, 96, 96, 3)
float32

データの分割

# データの分割
X_train, X_test, Y_train, Y_test = train_test_split(X_np, Y, test_size=0.3, random_state=0)
X_train, X_valid, Y_train, Y_valid = train_test_split(X_train, Y_train, test_size=0.3, random_state=0)

# 形状を確認
print("Y_train=", Y_train.shape, ", X_train=", X_train.shape)
print("Y_valid=", Y_valid.shape, ", X_valid=", X_valid.shape)
print("Y_test=", Y_test.shape, ", X_test=", X_test.shape)

Y_train= (2450, 10) , X_train= (2450, 96, 96, 3)
Y_valid= (1050, 10) , X_valid= (1050, 96, 96, 3)
Y_test= (1500, 10) , X_test= (1500, 96, 96, 3)

おわり

次回は、データの水増し！！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up