強化学習について学んでみた。（その22）

昨日はSarsa法によるAIの実装を行った。

今日はQ学習によるAIの実装を行っていく。

QLearningCom

さっそくQ学習によるAIの実装を。

#!/usr/bin/env ruby

require './tic_tac_toe'
require './state'
require './value'

module TicTacToe
  class QLearningCom
    @@epsilon = 0.1

    def initialize(mark, value, learning=true)
      @mark = mark
      @value = value
      @learning = learning
      @previous_reward = nil
      @previous_after_state = nil
    end

    attr_reader :mark
    attr_accessor :learning

    def select_index(state)
      max_action = @value.get_max_action(state, @mark)
      selected_action = max_action

      if @learning
        # 推定方策（グリーディ）で直前の行動の価値を学習
        after_state = state.set(max_action, @mark)
        if @previous_reward != nil && @previous_after_state != nil
          @value.update(@previous_after_state, @previous_reward, after_state)
        end

        # 挙動方策（ε-グリーディ）で行動を決定
        if Random.rand < @@epsilon
          selected_action = state.valid_actions.sample
        end
        @previous_after_state = state.set(selected_action, @mark)
      end

      return selected_action
    end

    def learn(reward, finished=false)
      if @learning
        if finished
          # 終端状態の場合、学習する機会がここしかないので、
          # ここで学習する
          @value.update(@previous_after_state, reward, nil)
          @previous_reward = nil
          @previous_after_state = nil
        else
          @previous_reward = reward
        end
      end
    end
  end
end

ほとんどSarsa法によるAIと同じなんだけど、ちょっと違うのは価値の学習の部分。
Sarsa法では次状態で行動の選択を行って、それで得た次の事後状態を使って学習するのに対し、Q学習では次状態で推定方策であるグリーディ方策に従った行動によって得られる次の事後状態を使って学習している。

実行例

さて、これでSarsa法によるAIもQ学習によるAIも実装できたので、次のようなコードを書いて実際に動かしてみる。

#!/usr/bin/env ruby

require './tic_tac_toe'
require './state'
require './game'
require './value'
require './human_player'
require './sarsa_com'
require './q_learning_com'

value = TicTacToe::Value.new
puts "学習方法を選択 [1,2]"
puts "1. Sarsa"
puts "2. Q学習"
selected_com = $stdin.gets.chomp.to_i
case selected_com
when 1
  com_1 = TicTacToe::SarsaCom.new(TicTacToe::MARU, value)
  com_2 = TicTacToe::SarsaCom.new(TicTacToe::BATSU, value)
when 2
  com_1 = TicTacToe::QLearningCom.new(TicTacToe::MARU, value)
  com_2 = TicTacToe::QLearningCom.new(TicTacToe::BATSU, value)
end

puts "学習回数を入力:"
count = $stdin.gets.chomp.to_i
count.times do
  game = TicTacToe::Game.new(com_1, com_2)
  game.start
end
com_1.learning = false
com_2.learning = false

loop do
  puts "選択 [1,2,3,4]"
  puts "1. 人間 vs COM"
  puts "2. COM vs 人間"
  puts "3. COM vs COM"
  puts "4. 終了"
  selected = $stdin.gets.chomp.to_i
  case selected
  when 1
    game = TicTacToe::Game.new(TicTacToe::HumanPlayer.new(TicTacToe::MARU), com_2)
  when 2
    game = TicTacToe::Game.new(com_1, TicTacToe::HumanPlayer.new(TicTacToe::BATSU))
  when 3
    game = TicTacToe::Game.new(com_1, com_2)
  when 4
    break
  end
  game.start(true)
end

実行すると、次のような感じ。

学習方法を選択 [1,2]
1. Sarsa
2. Q学習
1
学習回数を入力:
10000
選択 [1,2,3,4]
1. 人間 vs COM
2. COM vs 人間
3. COM vs COM
4. 終了
1

.|.|.
-+-+-
.|.|.
-+-+-
.|.|.

<player: o>
select index [1,2,3,4,5,6,7,8,9]
5

.|.|.
-+-+-
.|o|.
-+-+-
.|.|.


x|.|.
-+-+-
.|o|.
-+-+-
.|.|.


x|.|.
-+-+-
.|o|.
-+-+-
.|.|.

<player: o>
select index [2,3,4,6,7,8,9]
2

x|o|.
-+-+-
.|o|.
-+-+-
.|.|.


x|o|.
-+-+-
.|o|.
-+-+-
.|x|.


x|o|.
-+-+-
.|o|.
-+-+-
.|x|.

<player: o>
select index [3,4,6,7,9]
7

x|o|.
-+-+-
.|o|.
-+-+-
o|x|.


x|o|x
-+-+-
.|o|.
-+-+-
o|x|.


x|o|x
-+-+-
.|o|.
-+-+-
o|x|.

<player: o>
select index [4,6,9]
6

x|o|x
-+-+-
.|o|o
-+-+-
o|x|.


x|o|x
-+-+-
x|o|o
-+-+-
o|x|.


x|o|x
-+-+-
x|o|o
-+-+-
o|x|.

<player: o>
select index [9]
9

x|o|x
-+-+-
x|o|o
-+-+-
o|x|o

draw.
選択 [1,2,3,4]
1. 人間 vs COM
2. COM vs 人間
3. COM vs COM
4. 終了
4

何度か試した感じでは、Sarsa法もQ学習も10,000回くらい学習させると、変な手は指さなくなるかなぁ。

今日はここまで！