More than 5 years have passed since last update.

TensorFlowでDQN －箱庭の人工知能虫ー

Posted at 2016-04-29

ChainerでやってみるDeep Q Learning - 立ち上げ編で、見た目まさに人工知能っぽいことをやっていたので真似てやってみることに。
とはいえwxPythonとか使ったことないし、そのまんま真似ると難しそうなので、より簡単な感じにしました。まぁ遊びですし。
毎度おなじみ、あまり専門的なことは良く分からず雰囲気で書いているので、大幅に勘違いしている箇所があるかもしれません。気になった点は指摘いただけると幸いです。

目標

箱の中にリンゴ（に見立てた点）を多数配置。
そこに人工知能的虫を配置。
虫は上下左右および移動しないことを選択できる。
リンゴを食べる事が報酬。

設計概要

基本的には前回作ったものをベースに考える。
動きまわることで報酬を得て行く過程をディープラーニングで実装する。
ビジュアル的な表示が必要なので、wxPythonより簡単に使えそうなmatplotlibを使って表示させることにする。（本来はグラフ描画用ですが）

環境

Ubuntu 14.04
TensorFlow 0.7
GCE CPUx8 インスタンス
Python 2.7

実装

※走り書きのようにコーディングして、色々変えながら試していたのであまりキレイなコードではありません。ご了承ください。
一番下の方に全文載せています。

グラフ

def inference(x_ph):

    with tf.name_scope('hidden1'):
        weights = tf.Variable(tf.truncated_normal([NUM_IMPUT, NUM_HIDDEN1], stddev=stddev), name='weights')
        biases = tf.Variable(tf.zeros([NUM_HIDDEN1], dtype=tf.float32), name='biases')
        hidden1 = tf.nn.relu(tf.matmul(x_ph, weights) + biases)

    with tf.name_scope('hidden2'):
        weights = tf.Variable(tf.truncated_normal([NUM_HIDDEN1, NUM_HIDDEN2], stddev=stddev), name='weights')
        biases = tf.Variable(tf.zeros([NUM_HIDDEN2], dtype=tf.float32), name='biases')
        hidden2 = tf.nn.relu(tf.matmul(hidden1, weights) + biases)

    with tf.name_scope('output'):
        weights = tf.Variable(tf.truncated_normal([NUM_HIDDEN2, NUM_OUTPUT], stddev=stddev), name='weights')
        biases = tf.Variable(tf.zeros([NUM_OUTPUT], dtype=tf.float32), name='biases')
        y = tf.matmul(hidden2, weights) + biases

    return y

隠れ層は2層。ユニット数1000,500。この辺は適当に決めています。
活性化関数はreluで良いんだろうか・・・

入力ベクトル

    def getInputField(self, x, y):
        side = int(math.sqrt(NUM_IMPUT))
        rad = int(math.sqrt(NUM_IMPUT) / 2)
        field = [0.] * NUM_IMPUT
        for apple_x, apple_y in zip(apple_xs, apple_ys):
            if apple_x >= x - rad and apple_x <= x + rad \
            and apple_y >= y - rad and apple_y <= y + rad:
                idx = side * (apple_y - y + rad) + (apple_x - x + rad)
                field[idx] = 1.

        return field

自分の周りの半径9マスを見て、リンゴがある箇所を1として、これを入力に使っています。自分を含めた19*19の四角が視界ということになります。

・・・あ、時間なくなってきた。
気が向いたら今度細かいところ書きます。
下記にコードと学習後の動画を載せておきます。
気になる点、その違うんじゃね？こうした方がいいんじゃね？というところあればコメントいただければ幸いです。

結果

ソースコード

# -*- coding: UTF-8 -*-

# import
import math
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import random
import tensorflow as tf
import numpy as np

# definition
NUM_IMPUT = 361
NUM_HIDDEN1 = 1000
NUM_HIDDEN2 = 500
NUM_OUTPUT = 5
LEARNING_RATE = 0.1
REPEAT_TIMES = 1000
LOG_DIR = "tf_log"
GAMMA = 0.9
stddev = 0.01
RANDOM_FACTOR = 0.1
BATCH = 300
FRAMES = 600
MAX_X = 40
MAX_Y = 40
X_LIMIT = [0, MAX_X]
Y_LIMIT = [0, MAX_Y]
apple_xs = []
apple_ys = []
NUM_CELL = 1
NUM_APPLE = 200

def inference(x_ph):

    with tf.name_scope('hidden1'):
        weights = tf.Variable(tf.truncated_normal([NUM_IMPUT, NUM_HIDDEN1], stddev=stddev), name='weights')
        biases = tf.Variable(tf.zeros([NUM_HIDDEN1], dtype=tf.float32), name='biases')
        hidden1 = tf.nn.relu(tf.matmul(x_ph, weights) + biases)

    with tf.name_scope('hidden2'):
        weights = tf.Variable(tf.truncated_normal([NUM_HIDDEN1, NUM_HIDDEN2], stddev=stddev), name='weights')
        biases = tf.Variable(tf.zeros([NUM_HIDDEN2], dtype=tf.float32), name='biases')
        hidden2 = tf.nn.relu(tf.matmul(hidden1, weights) + biases)

    with tf.name_scope('output'):
        weights = tf.Variable(tf.truncated_normal([NUM_HIDDEN2, NUM_OUTPUT], stddev=stddev), name='weights')
        biases = tf.Variable(tf.zeros([NUM_OUTPUT], dtype=tf.float32), name='biases')
        y = tf.matmul(hidden2, weights) + biases

    return y

def loss(y, y_ph):
    return tf.reduce_mean(tf.nn.l2_loss((y - y_ph)))

def optimize(loss):
    optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
    train_step = optimizer.minimize(loss)
    return train_step

class Cell:
    def __init__(self, training=True):
        self.lifetime = random.randint(1, 100)
        self.x = random.randint(0, MAX_X)
        self.y = random.randint(0, MAX_Y)
        self.input_history = []
        self.reward_history = []
        self.training = training

    def getInputField(self, x, y):
        side = int(math.sqrt(NUM_IMPUT))
        rad = int(math.sqrt(NUM_IMPUT) / 2)
        field = [0.] * NUM_IMPUT
        for apple_x, apple_y in zip(apple_xs, apple_ys):
            if apple_x >= x - rad and apple_x <= x + rad \
            and apple_y >= y - rad and apple_y <= y + rad:
                idx = side * (apple_y - y + rad) + (apple_x - x + rad)
                field[idx] = 1.

        return field

    def getNextPositionReward(self, x, y):
        for apple_x, apple_y in zip(apple_xs, apple_ys):
            if apple_x >= x-1 and apple_x <= x+1 and \
                apple_y >= y-1 and apple_y <= y+1:
                return 1.

        return 0.

    def moveNextPosition(self, next_position_rewards):

        if random.random() < RANDOM_FACTOR and self.training:
            act = random.randint(0, 4)
        else:
            act = np.argmax(next_position_rewards)

        if act == 1 and self.x < MAX_X :
            self.x += 1
        elif act == 2 and self.x > 0:
            self.x -= 1
        elif act == 3 and self.y < MAX_Y:
            self.y += 1
        elif act == 4 and self.y > 0:
            self.y -= 1

        for i in range(len(apple_xs)):
            if apple_xs[i] >= self.x-1 and apple_xs[i] <= self.x+1 and \
                apple_ys[i] >= self.y-1 and apple_ys[i] <= self.y+1:
                apple_xs.pop(i)
                apple_ys.pop(i)
                break

    def action(self):
        input0 = self.getInputField(self.x, self.y)
        input1 = self.getInputField(self.x+1, self.y)
        input2 = self.getInputField(self.x-1, self.y)
        input3 = self.getInputField(self.x, self.y+1)
        input4 = self.getInputField(self.x, self.y-1)

        next_position_rewards = []
        next_position_rewards.append(self.getNextPositionReward(self.x, self.y))
        next_position_rewards.append(self.getNextPositionReward(self.x+1, self.y))
        next_position_rewards.append(self.getNextPositionReward(self.x-1, self.y))
        next_position_rewards.append(self.getNextPositionReward(self.x, self.y+1))
        next_position_rewards.append(self.getNextPositionReward(self.x, self.y-1))

        future_rewards_array = sess.run(y, feed_dict={x_ph: [input0, input1, input2, input3, input4]})
        for i in range(5):
            next_position_rewards[i] = next_position_rewards[i] + GAMMA * np.max(future_rewards_array[i])

        self.input_history.append(input0)
        self.reward_history.append(next_position_rewards)

        self.moveNextPosition(next_position_rewards)


class World:
    def __init__(self, training=True):

        self.cells = []
        for i in range(NUM_CELL):
            cell = Cell(training)
            self.cells.append(cell)

        while(len(apple_xs) != 0):
            apple_xs.pop()
            apple_ys.pop()

        for i in range(NUM_APPLE):
            apple_xs.append(random.randint(0, MAX_X))
            apple_ys.append(random.randint(0, MAX_Y))

    def _update_plot(self, i, fig, im):
        xs = []
        ys = []
        for cell in self.cells:
            cell.action()
            xs.append(cell.x)
            ys.append(cell.y)

        self.red.set_data(xs, ys)
        self.yellow.set_data(apple_xs, apple_ys)

    def showAnimation(self, filename=None):

        self.fig =  plt.figure()
        ax = self.fig.add_subplot(1,1,1)

        ax.set_xlim(X_LIMIT)
        ax.set_ylim(Y_LIMIT)

        #addition
        self.red, = ax.plot([], [], 'ro', lw=2)
        self.yellow, = ax.plot([], [], 'yo', lw=2)

        self.im = []

        ani = animation.FuncAnimation(self.fig, self._update_plot, fargs = (self.fig, self.im),
                                          frames = FRAMES, interval = 10, repeat = False)

#         plt.show()

        if filename != None:
            ani.save(filename, writer="mencoder")

    def training(self):
        for cell in self.cells:
            cell.action()

if __name__ == "__main__":

    x_ph = tf.placeholder(tf.float32, [None, NUM_IMPUT])
    y_ph = tf.placeholder(tf.float32, [None, NUM_OUTPUT])

    y = inference(x_ph)
    loss = loss(y, y_ph)
    tf.scalar_summary("Loss", loss)
    train_step = optimize(loss)

    sess = tf.Session()
    summary_op = tf.merge_all_summaries()
    init = tf.initialize_all_variables()
    sess.run(init)
    summary_writer = tf.train.SummaryWriter(LOG_DIR, graph_def=sess.graph_def)

    for i in range(REPEAT_TIMES):

        world = World()
        for j in range(BATCH):
            for cell in world.cells:
                cell.action()

        for cell in world.cells:
            sess.run(train_step, feed_dict={x_ph: cell.input_history, y_ph: cell.reward_history})
            summary_str = sess.run(summary_op, feed_dict={x_ph: cell.input_history, y_ph: cell.reward_history})
            summary_writer.add_summary(summary_str, i)
            ce = sess.run(loss, feed_dict={x_ph: cell.input_history, y_ph: cell.reward_history})
            print "Cross Entropy: " + str(ce)

        print "Count: " + str(i)

    world = World()
    world.showAnimation("ani.mp4")

読むと分かると思いますが、人工知能虫は2つ以上にすることもできます。

所感

意外と処理が重い。少し良いインスタンス使っているので、もっと計算は速いものかと思っていましたが、意外と時間がかかる。そのため、バッチを短くしたり回す回数が割とシビアになりやすい。
結果は一応リンゴを追っているようだが、周りにリンゴがなくなるとどうして良いか分からない風な動きになるっぽい。バッチを短めにセットしたので、長期的な視点での動きは学習できなかったのだろうか。視界には入っているハズだけど、近くないと反応しないように見える。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up