昨日は方策オン型モンテカルロ制御と方策オフ型モンテカルロ制御について説明した。
今日は、実際にこれらのアルゴリズムを使ったプログラムを書いてみる。
レーストラック
本で練習問題とされているレーストラックの問題を、方策オン型モンテカルロ制御、方策オフ型モンテカルロ制御の両方で実装してみる。
(ただし、プログラムしやすいようにちょっと追加の設定を入れてる)
問題は以下のとおり:
- 格子状のコースを走り、コースアウトしないようにしながら、出来るだけ早くゴールするようにする。
- スタートとゴール
- スタートラインのいずれかのマスからスタート。
- ゴールラインのいずれかのマスに着いたらゴール。
- 状態
- 状態は、位置と速度の組み合わせ。
- 速度
- 速度は離散的で、水平方向と垂直方向の成分で表される。
- 速度の各成分は-4, -3, -2, -1, 0, 1, 2, 3, 4のいずれか。
- 両方とも0になることは出来ない。
- 行動
- 行動として、速度を水平方向、垂直方向にそれぞれ+1, -1, 0だけ変化させることが出来る。
- 報酬
- 1ステップ毎に-1点、コースアウトしようとしたら-5点の報酬を得る。
- 追加の設定
- 移動はワープとして扱う。
- 移動元と移動先の間に直線を引いたとき、コースアウトしているかどうかを計算するのが難しい。
- ゴールラインを通過したかどうかの判定が難しい。
- コースアウトする場合、その場に留まるとする。
- コースアウトした場合の扱いが定義されていない。
- 移動はワープとして扱う。
- スタートとゴール
Rubyによる実装
コースの設定
まずは、コースを記述するための定数と、それを使って設定したコースから。
#==================== # racetrack.rb #==================== module Racetrack # 以下の文字列でコースを表現する START = 'S' IN = '.' OUT = 'o' GOAL = 'G' # 最高スピードの絶対値 MAX_SPEED = 4 end # コース1 COURSE1 = <<__END_OF_COURSE1__ ooooooooooooooooooooooooo ooooooooooooooooooooooooo ooooooooooooooooooooooooo ooooooooooooooooooooooooo ooooooo.............Goooo oooooo..............Goooo oooooo..............Goooo ooooo...............Goooo oooo................Goooo oooo................Goooo oooo..........ooooooooooo oooo.........oooooooooooo oooo.........oooooooooooo oooo.........oooooooooooo oooo.........oooooooooooo oooo.........oooooooooooo oooo.........oooooooooooo oooo.........oooooooooooo ooooo........oooooooooooo ooooo........oooooooooooo ooooo........oooooooooooo ooooo........oooooooooooo ooooo........oooooooooooo ooooo........oooooooooooo ooooo........oooooooooooo ooooo........oooooooooooo oooooo.......oooooooooooo oooooo.......oooooooooooo oooooo.......oooooooooooo oooooo.......oooooooooooo oooooo.......oooooooooooo oooooo.......oooooooooooo oooooo.......oooooooooooo ooooooo......oooooooooooo ooooooo......oooooooooooo oooooooSSSSSSoooooooooooo ooooooooooooooooooooooooo ooooooooooooooooooooooooo ooooooooooooooooooooooooo ooooooooooooooooooooooooo __END_OF_COURSE1__ # コース2 COURSE2 = <<__END_OF_COURSE2__ oooooooooooooooooooooooooooooooooooooooo oooooooooooooooooooooooooooooooooooooooo oooooooooooooooooooooooooooooooooooooooo oooooooooooooooooooooooooooooooooooooooo oooooooooooooooooooo...............Goooo ooooooooooooooooo..................Goooo oooooooooooooooo...................Goooo ooooooooooooooo....................Goooo ooooooooooooooo....................Goooo ooooooooooooooo....................Goooo ooooooooooooooo....................Goooo oooooooooooooooo...................Goooo ooooooooooooooooo..................Goooo oooooooooooooooooo................oooooo oooooooooooooooooo.............ooooooooo oooooooooooooooooo............oooooooooo oooooooooooooooooo..........oooooooooooo oooooooooooooooooo.........ooooooooooooo ooooooooooooooooo..........ooooooooooooo oooooooooooooooo...........ooooooooooooo ooooooooooooooo............ooooooooooooo oooooooooooooo.............ooooooooooooo ooooooooooooo..............ooooooooooooo oooooooooooo...............ooooooooooooo ooooooooooo................ooooooooooooo oooooooooo.................ooooooooooooo ooooooooo..................ooooooooooooo oooooooo...................ooooooooooooo ooooooo....................ooooooooooooo oooooo.....................ooooooooooooo ooooo......................ooooooooooooo oooo.......................ooooooooooooo oooo.......................ooooooooooooo ooooSSSSSSSSSSSSSSSSSSSSSSSooooooooooooo oooooooooooooooooooooooooooooooooooooooo oooooooooooooooooooooooooooooooooooooooo oooooooooooooooooooooooooooooooooooooooo oooooooooooooooooooooooooooooooooooooooo __END_OF_COURSE2__
コース
次に、コースのクラス。
なお、状態と行動は文字列にエンコードして扱っている。
#==================== # course.rb #==================== require './racetrack' module Racetrack class Course # コースを表す文字列からコースを作成する def initialize(course) @course_info = Array.new @starts = Array.new course.each_line.with_index do |line, y| line_info = Array.new line.each_char.with_index do |c, x| line_info.push c if c == START @starts.push to_state(x, y, 0, 0) end end line_info.freeze @course_info.push line_info end @course_info.freeze @starts.freeze end attr_reader :starts def in?(state) to_course_info(state) == IN end def out?(state) to_course_info(state) == OUT end def start?(state) to_course_info(state) == START end def goal?(state) to_course_info(state) == GOAL end # 可能な行動の集合を返す def get_valid_actions(state) vx, vy = to_speed(state) valid_actions = Array.new (-1..1).each do |ax| (-1..1).each do |ay| new_vx = vx + ax new_vy = vy + ay if (new_vx.abs <= MAX_SPEED && new_vy.abs <= MAX_SPEED && new_vx.abs + new_vy.abs > 0) valid_actions.push to_action(ax, ay) end end end return valid_actions end # ある状態から行動し、次の状態と報酬を返す def step(state, action) x, y, vx, vy = to_position_speed(state) ax, ay = to_ax_ay(action) vx += ax vy += ay new_x = x + vx new_y = y + vy if course_info(new_x, new_y) == OUT next_state = to_state(x, y, vx, vy) reward = -5 else next_state = to_state(new_x, new_y, vx, vy) reward = -1 end return next_state, reward end private def to_state(x, y, vx, vy) [x, y, vx, vy].map(&:to_s).join(',') end def to_position_speed(state) state.split(',').map(&:to_i) end def to_position(state) to_position_speed(state).slice(0, 2) end def to_speed(state) to_position_speed(state).slice(2, 2) end def to_course_info(state) x, y = to_position(state) course_info(x, y) end def course_info(x, y) @course_info[y][x] end def to_action(ax, ay) [ax, ay].map(&:to_s).join(',') end def to_ax_ay(action) action.split(',').map(&:to_i) end end end
行動価値
次は行動価値。
方策オフ型モンテカルロ制御でも使えるように、重みをつけて更新できるようにしてある。
#==================== # value.rb #==================== require './racetrack' require './course' module Racetrack class Value # 方策オフ型モンテカルロ法でも使えるように、 # 行動価値を出すための分子(points: 収益に重み付けしたものの合計)と # 分母(weights: 重みの合計)を保持する。 def initialize @value = Hash.new do |hash, state| hash[state] = Hash.new do |hash, action| hash[action] = 0.0 end end @points = Hash.new do |hash, state| hash[state] = Hash.new do |hash, action| hash[action] = 0.0 end end @weights = Hash.new do |hash, state| hash[state] = Hash.new do |hash, action| hash[action] = 0.0 end end end def get(state, action) @value[state][action] end def get_max_action(state) hash = @value[state] max_value = hash.values.max max_action = hash.key(max_value) return max_action end def update(state, action, reward, weight=1.0) @points[state][action] += reward * weight @weights[state][action] += weight @value[state][action] = @points[state][action] / @weights[state][action] end end end
方策
方策はシンプルな決定論的方策で、ある状態に対して一つの行動を返すようにしている。
ソフト方策にする場合、使う側でソフトになるようにする。
#==================== # policy.rb #==================== require './racetrack' require './course' module Racetrack class Policy # 初期の方策は可能な行動からランダムに選択する def initialize(course) @course = course @policy = Hash.new do |hash, state| valid_actions = @course.get_valid_actions(state) hash[state] = valid_actions.sample end end def get(state) @policy[state] end def set(state, action) @policy[state] = action end end end
方策オン型モンテカルロ制御
さて、方策オン型モンテカルロ制御。
#!/usr/bin/env ruby #==================== # on_policy_monte_carlo_method.rb #==================== require './racetrack' require './course' require './value' require './policy' module Racetrack class OnPolicyMonteCarloMethod def initialize(course, value, policy, epsilon=0.1) @course = course @value = value @policy = policy @epsilon = epsilon end # エピソードを生成し、方策オン型モンテカルロ法で学習を行う # スタートstartが指定されない場合、ランダムに選ぶ # learningがfalseの場合、学習を行わない def simulate(start=nil, learning=true, verbose=false) if start == nil starts = @course.starts start = starts.sample end states = Array.new actions = Array.new rewards = Array.new # エピソードを生成 state = start states.push state loop do valid_actions = @course.get_valid_actions(state) select_action = @policy.get(state) if learning select = Random.rand valid_actions.each_with_index do |action, i| if select < ((@epsilon / valid_actions.size) * (i + 1)) select_action = action break end end end actions.push select_action next_state, reward = @course.step(state, select_action) states.push next_state rewards.push reward puts sprintf("position: (%2s, %2s), speed: (%2s, %2s), action: (%2s, %2s), reward: %2d", *(state.split(",")), *(select_action.split(",")), reward) if verbose if @course.goal?(next_state) puts "total rewards: #{rewards.inject(:+)}" if verbose break else state = next_state end end if learning # 価値の更新 updated = Array.new states.pop # 最後は終端状態なので捨てる states.zip(actions).each_with_index do |(state, action), i| unless updated.include?([state, action]) reward_sum = rewards[i, rewards.size - i].inject(:+) @value.update(state, action, reward_sum) updated.push([state, action]) end end # 方策の更新 updated.each do |state, action| max_action = @value.get_max_action(state) @policy.set(state, max_action) end end end end end if __FILE__ == $PROGRAM_NAME # コース1 puts "------------------------------" puts "コース1" puts "------------------------------" course = Racetrack::Course.new(COURSE1) value = Racetrack::Value.new policy = Racetrack::Policy.new(course) opmc_method = Racetrack::OnPolicyMonteCarloMethod.new(course, value, policy, 0.4) (1..10).each do |i| 10000.times do opmc_method.simulate end puts "--------------------" puts "[#{i * 10000}]" puts "--------------------" course.starts.each do |start| puts "----------" puts "<start: #{start}>" opmc_method.simulate(start, false, true) puts "----------" end end # コース2 puts "------------------------------" puts "コース2" puts "------------------------------" course = Racetrack::Course.new(COURSE2) value = Racetrack::Value.new policy = Racetrack::Policy.new(course) opmc_method = Racetrack::OnPolicyMonteCarloMethod.new(course, value, policy, 0.4) (1..10).each do |i| 10000.times do opmc_method.simulate end puts "--------------------" puts "[#{i * 10000}]" puts "--------------------" course.starts.each do |start| puts "----------" puts "<start: #{start}>" opmc_method.simulate(start, false, true) puts "----------" end end end
まぁ、難しいことはしてないので、特に説明はいらないかな。
一応、学習を行わないときにはソフト方策ではなく決定論的な方策を使って動くようにしている。
(でないと、「現在最適と思っている行動」が分からないので)
これを実行すると、次のような感じ。
------------------------------ コース1 ------------------------------ -------------------- [10000] -------------------- ---------- <start: 7,35,0,0> position: ( 7, 35), speed: ( 0, 0), action: ( 0, -1), reward: -1 position: ( 7, 34), speed: ( 0, -1), action: ( 0, 0), reward: -1 position: ( 7, 33), speed: ( 0, -1), action: (-1, 0), reward: -1 position: ( 6, 32), speed: (-1, -1), action: ( 1, -1), reward: -1 position: ( 6, 30), speed: ( 0, -2), action: ( 1, -1), reward: -1 position: ( 7, 27), speed: ( 1, -3), action: (-1, -1), reward: -1 position: ( 7, 23), speed: ( 0, -4), action: ( 1, 0), reward: -1 position: ( 8, 19), speed: ( 1, -4), action: ( 0, 0), reward: -1 position: ( 9, 15), speed: ( 1, -4), action: ( 1, 1), reward: -1 position: (11, 12), speed: ( 2, -3), action: ( 1, 0), reward: -1 position: (14, 9), speed: ( 3, -3), action: ( 0, 1), reward: -1 position: (17, 7), speed: ( 3, -2), action: ( 0, -1), reward: -1 total rewards: -12 ---------- ---------- <start: 8,35,0,0> position: ( 8, 35), speed: ( 0, 0), action: ( 1, 0), reward: -1 position: ( 9, 35), speed: ( 1, 0), action: (-1, -1), reward: -1 position: ( 9, 34), speed: ( 0, -1), action: ( 0, -1), reward: -1 position: ( 9, 32), speed: ( 0, -2), action: ( 0, -1), reward: -1 position: ( 9, 29), speed: ( 0, -3), action: ( 0, 0), reward: -1 position: ( 9, 26), speed: ( 0, -3), action: ( 1, -1), reward: -1 position: (10, 22), speed: ( 1, -4), action: ( 0, 0), reward: -1 position: (11, 18), speed: ( 1, -4), action: ( 1, 1), reward: -5 position: (11, 18), speed: ( 2, -3), action: (-1, 0), reward: -1 position: (12, 15), speed: ( 1, -3), action: (-1, 0), reward: -1 position: (12, 12), speed: ( 0, -3), action: ( 0, 0), reward: -1 position: (12, 9), speed: ( 0, -3), action: ( 1, 0), reward: -1 position: (13, 6), speed: ( 1, -3), action: ( 1, 1), reward: -1 position: (15, 4), speed: ( 2, -2), action: ( 0, 1), reward: -5 position: (15, 4), speed: ( 2, -1), action: ( 1, 1), reward: -1 position: (18, 4), speed: ( 3, 0), action: (-1, 1), reward: -1 total rewards: -24 ---------- ---------- <start: 9,35,0,0> position: ( 9, 35), speed: ( 0, 0), action: (-1, -1), reward: -1 position: ( 8, 34), speed: (-1, -1), action: ( 1, -1), reward: -1 position: ( 8, 32), speed: ( 0, -2), action: (-1, -1), reward: -1 position: ( 7, 29), speed: (-1, -3), action: ( 1, -1), reward: -1 position: ( 7, 25), speed: ( 0, -4), action: ( 1, 0), reward: -1 position: ( 8, 21), speed: ( 1, -4), action: ( 0, 0), reward: -1 position: ( 9, 17), speed: ( 1, -4), action: ( 0, 1), reward: -1 position: (10, 14), speed: ( 1, -3), action: ( 1, 1), reward: -1 position: (12, 12), speed: ( 2, -2), action: ( 0, -1), reward: -1 position: (14, 9), speed: ( 2, -3), action: ( 1, 1), reward: -1 position: (17, 7), speed: ( 3, -2), action: ( 0, -1), reward: -1 total rewards: -11 ---------- ---------- <start: 10,35,0,0> position: (10, 35), speed: ( 0, 0), action: ( 0, -1), reward: -1 position: (10, 34), speed: ( 0, -1), action: ( 0, -1), reward: -1 position: (10, 32), speed: ( 0, -2), action: (-1, -1), reward: -1 position: ( 9, 29), speed: (-1, -3), action: ( 1, -1), reward: -1 position: ( 9, 25), speed: ( 0, -4), action: (-1, 0), reward: -1 position: ( 8, 21), speed: (-1, -4), action: ( 1, 0), reward: -1 position: ( 8, 17), speed: ( 0, -4), action: ( 1, 0), reward: -1 position: ( 9, 13), speed: ( 1, -4), action: ( 1, 1), reward: -1 position: (11, 10), speed: ( 2, -3), action: ( 0, 0), reward: -1 position: (13, 7), speed: ( 2, -3), action: ( 1, 1), reward: -1 position: (16, 5), speed: ( 3, -2), action: ( 1, 1), reward: -1 total rewards: -11 ---------- ---------- <start: 11,35,0,0> position: (11, 35), speed: ( 0, 0), action: ( 1, -1), reward: -1 position: (12, 34), speed: ( 1, -1), action: (-1, 0), reward: -1 position: (12, 33), speed: ( 0, -1), action: ( 0, 0), reward: -1 position: (12, 32), speed: ( 0, -1), action: ( 1, -1), reward: -5 position: (12, 32), speed: ( 1, -2), action: (-1, 0), reward: -1 position: (12, 30), speed: ( 0, -2), action: ( 0, -1), reward: -1 position: (12, 27), speed: ( 0, -3), action: ( 0, 0), reward: -1 position: (12, 24), speed: ( 0, -3), action: (-1, -1), reward: -1 position: (11, 20), speed: (-1, -4), action: ( 1, 0), reward: -1 position: (11, 16), speed: ( 0, -4), action: ( 0, 0), reward: -1 position: (11, 12), speed: ( 0, -4), action: ( 1, 0), reward: -1 position: (12, 8), speed: ( 1, -4), action: ( 1, 1), reward: -1 position: (14, 5), speed: ( 2, -3), action: ( 1, 1), reward: -5 position: (14, 5), speed: ( 3, -2), action: ( 1, 1), reward: -1 position: (18, 4), speed: ( 4, -1), action: (-1, 1), reward: -5 position: (18, 4), speed: ( 3, 0), action: (-1, 1), reward: -1 total rewards: -28 ---------- ---------- <start: 12,35,0,0> position: (12, 35), speed: ( 0, 0), action: (-1, -1), reward: -1 position: (11, 34), speed: (-1, -1), action: ( 0, -1), reward: -1 position: (10, 32), speed: (-1, -2), action: ( 0, -1), reward: -1 position: ( 9, 29), speed: (-1, -3), action: ( 1, -1), reward: -1 position: ( 9, 25), speed: ( 0, -4), action: (-1, 0), reward: -1 position: ( 8, 21), speed: (-1, -4), action: ( 1, 0), reward: -1 position: ( 8, 17), speed: ( 0, -4), action: ( 1, 0), reward: -1 position: ( 9, 13), speed: ( 1, -4), action: ( 1, 1), reward: -1 position: (11, 10), speed: ( 2, -3), action: ( 0, 0), reward: -1 position: (13, 7), speed: ( 2, -3), action: ( 1, 1), reward: -1 position: (16, 5), speed: ( 3, -2), action: ( 1, 1), reward: -1 total rewards: -11 ---------- 〜省略〜 -------------------- [100000] -------------------- ---------- <start: 7,35,0,0> position: ( 7, 35), speed: ( 0, 0), action: ( 0, -1), reward: -1 position: ( 7, 34), speed: ( 0, -1), action: ( 0, -1), reward: -1 position: ( 7, 32), speed: ( 0, -2), action: ( 0, -1), reward: -1 position: ( 7, 29), speed: ( 0, -3), action: ( 0, -1), reward: -1 position: ( 7, 25), speed: ( 0, -4), action: ( 1, 0), reward: -1 position: ( 8, 21), speed: ( 1, -4), action: ( 0, 0), reward: -1 position: ( 9, 17), speed: ( 1, -4), action: ( 1, 0), reward: -1 position: (11, 13), speed: ( 2, -4), action: ( 1, 0), reward: -1 position: (14, 9), speed: ( 3, -4), action: ( 0, 1), reward: -1 position: (17, 6), speed: ( 3, -3), action: ( 0, 1), reward: -1 total rewards: -10 ---------- ---------- <start: 8,35,0,0> position: ( 8, 35), speed: ( 0, 0), action: (-1, -1), reward: -1 position: ( 7, 34), speed: (-1, -1), action: ( 1, -1), reward: -1 position: ( 7, 32), speed: ( 0, -2), action: ( 0, -1), reward: -1 position: ( 7, 29), speed: ( 0, -3), action: ( 0, -1), reward: -1 position: ( 7, 25), speed: ( 0, -4), action: ( 1, 0), reward: -1 position: ( 8, 21), speed: ( 1, -4), action: ( 0, 0), reward: -1 position: ( 9, 17), speed: ( 1, -4), action: ( 1, 0), reward: -1 position: (11, 13), speed: ( 2, -4), action: ( 1, 0), reward: -1 position: (14, 9), speed: ( 3, -4), action: ( 0, 1), reward: -1 position: (17, 6), speed: ( 3, -3), action: ( 0, 1), reward: -1 total rewards: -10 ---------- ---------- <start: 9,35,0,0> position: ( 9, 35), speed: ( 0, 0), action: (-1, -1), reward: -1 position: ( 8, 34), speed: (-1, -1), action: ( 1, -1), reward: -1 position: ( 8, 32), speed: ( 0, -2), action: (-1, -1), reward: -1 position: ( 7, 29), speed: (-1, -3), action: ( 1, -1), reward: -1 position: ( 7, 25), speed: ( 0, -4), action: ( 1, 0), reward: -1 position: ( 8, 21), speed: ( 1, -4), action: ( 0, 0), reward: -1 position: ( 9, 17), speed: ( 1, -4), action: ( 1, 0), reward: -1 position: (11, 13), speed: ( 2, -4), action: ( 1, 0), reward: -1 position: (14, 9), speed: ( 3, -4), action: ( 0, 1), reward: -1 position: (17, 6), speed: ( 3, -3), action: ( 0, 1), reward: -1 total rewards: -10 ---------- ---------- <start: 10,35,0,0> position: (10, 35), speed: ( 0, 0), action: ( 0, -1), reward: -1 position: (10, 34), speed: ( 0, -1), action: ( 0, -1), reward: -1 position: (10, 32), speed: ( 0, -2), action: (-1, -1), reward: -1 position: ( 9, 29), speed: (-1, -3), action: ( 1, -1), reward: -1 position: ( 9, 25), speed: ( 0, -4), action: (-1, 0), reward: -1 position: ( 8, 21), speed: (-1, -4), action: ( 1, 1), reward: -1 position: ( 8, 18), speed: ( 0, -3), action: ( 1, 0), reward: -1 position: ( 9, 15), speed: ( 1, -3), action: ( 1, 0), reward: -1 position: (11, 12), speed: ( 2, -3), action: ( 1, 0), reward: -1 position: (14, 9), speed: ( 3, -3), action: ( 0, 1), reward: -1 position: (17, 7), speed: ( 3, -2), action: ( 0, -1), reward: -1 total rewards: -11 ---------- ---------- <start: 11,35,0,0> position: (11, 35), speed: ( 0, 0), action: (-1, -1), reward: -1 position: (10, 34), speed: (-1, -1), action: ( 1, -1), reward: -1 position: (10, 32), speed: ( 0, -2), action: (-1, -1), reward: -1 position: ( 9, 29), speed: (-1, -3), action: ( 1, -1), reward: -1 position: ( 9, 25), speed: ( 0, -4), action: (-1, 0), reward: -1 position: ( 8, 21), speed: (-1, -4), action: ( 1, 1), reward: -1 position: ( 8, 18), speed: ( 0, -3), action: ( 1, 0), reward: -1 position: ( 9, 15), speed: ( 1, -3), action: ( 1, 0), reward: -1 position: (11, 12), speed: ( 2, -3), action: ( 1, 0), reward: -1 position: (14, 9), speed: ( 3, -3), action: ( 0, 1), reward: -1 position: (17, 7), speed: ( 3, -2), action: ( 0, -1), reward: -1 total rewards: -11 ---------- ---------- <start: 12,35,0,0> position: (12, 35), speed: ( 0, 0), action: (-1, -1), reward: -1 position: (11, 34), speed: (-1, -1), action: ( 0, -1), reward: -1 position: (10, 32), speed: (-1, -2), action: ( 0, -1), reward: -1 position: ( 9, 29), speed: (-1, -3), action: ( 1, -1), reward: -1 position: ( 9, 25), speed: ( 0, -4), action: (-1, 0), reward: -1 position: ( 8, 21), speed: (-1, -4), action: ( 1, 1), reward: -1 position: ( 8, 18), speed: ( 0, -3), action: ( 1, 0), reward: -1 position: ( 9, 15), speed: ( 1, -3), action: ( 1, 0), reward: -1 position: (11, 12), speed: ( 2, -3), action: ( 1, 0), reward: -1 position: (14, 9), speed: ( 3, -3), action: ( 0, 1), reward: -1 position: (17, 7), speed: ( 3, -2), action: ( 0, -1), reward: -1 total rewards: -11 ---------- ------------------------------ コース2 ------------------------------ 〜省略〜
学習を進めるとより最適な方策に近づいていることが分かる。
方策オフ型モンテカルロ制御
次は、方策オフ型モンテカルロ制御。
挙動方策としては、グリーディ方策を用いている。
#!/usr/bin/env ruby #==================== # off_policy_monte_carlo_method.rb #==================== require './racetrack' require './course' require './value' require './policy' module Racetrack class OffPolicyMonteCarloMethod def initialize(course, value, policy, epsilon=0.1) @course = course @value = value @policy = policy @epsilon = 0.1 end # エピソードを生成し、方策オフ型モンテカルロ法で学習を行う # スタートstartが指定されない場合、ランダムに選ぶ # learningがfalseの場合、ソフト方策を使わず、学習を行わない def simulate(start=nil, learning=true, verbose=false) if start == nil starts = @course.starts start = starts.sample end states = Array.new actions = Array.new rewards = Array.new # エピソードを生成 state = start states.push state loop do valid_actions = @course.get_valid_actions(state) select_action = @policy.get(state) if learning select = Random.rand valid_actions.each_with_index do |action, i| if select < ((@epsilon / valid_actions.size) * (i + 1)) select_action = action break end end end actions.push select_action next_state, reward = @course.step(state, select_action) states.push next_state rewards.push reward if @course.goal?(next_state) puts "total rewards: #{rewards.inject(:+)}" if verbose break else state = next_state end end # 価値と方策の更新 if learning # 後ろから更新に必要な情報を作っていく update_info = Array.new states.pop # 最後は終端状態なので捨てる state = states.pop action = actions.pop reward_sum = rewards.pop weight = 1.0 loop do actions_size = @course.get_valid_actions(state).size if action == @policy.get(state) update_info.unshift([state, action, reward_sum, weight]) if states.empty? break else state = states.pop action = actions.pop reward_sum += rewards.pop weight *= 1.0 / ((1 - @epsilon) + (@epsilon / actions_size)) end else update_info.unshift([state, action, reward_sum, weight]) break end end # 価値を更新 updated = Array.new update_info.each do |state, action, reward_sum, weight| unless updated.include?([state, action]) @value.update(state, action, reward_sum, weight) updated.push([state, action]) end end # 方策の更新 updated.each do |state, action| max_action = @value.get_max_action(state) @policy.set(state, max_action) end end end end end if __FILE__ == $PROGRAM_NAME # コース1 puts "------------------------------" puts "コース1" puts "------------------------------" course = Racetrack::Course.new(COURSE1) value = Racetrack::Value.new policy = Racetrack::Policy.new(course) opmc_method = Racetrack::OffPolicyMonteCarloMethod.new(course, value, policy, 0.4) (1..10).each do |i| 10000.times do opmc_method.simulate end puts "--------------------" puts "[#{i * 10000}]" puts "--------------------" course.starts.each do |start| puts "----------" puts "<start: #{start}>" opmc_method.simulate(start, false, true) puts "----------" end end # コース2 puts "------------------------------" puts "コース2" puts "------------------------------" course = Racetrack::Course.new(COURSE2) value = Racetrack::Value.new policy = Racetrack::Policy.new(course) opmc_method = Racetrack::OffPolicyMonteCarloMethod.new(course, value, policy, 0.4) (1..10).each do |i| 10000.times do opmc_method.simulate end puts "--------------------" puts "[#{i * 10000}]" puts "--------------------" course.starts.each do |start| puts "----------" puts "<start: #{start}>" opmc_method.simulate(start, false, true) puts "----------" end end end
学習を行わないときには挙動方策ではなく推定方策を使って動くようにしているのは、方策オン型モンテカルロ制御の方と同じ。
こちらは、「後ろから更新に必要な情報を作っていく」という部分について、ちょっと説明が必要かもしれない。
方策オフ型モンテカルロ制御では、各状態行動対の価値を更新するときに、それ以降の状態行動対の列が観測される確率から重みを計算していた。
ただ、その重みは次の状態行動対になる確率(=方策)を最後まで掛け算していくので、これを毎回計算するのはちょっと面倒。
そこで、この重みが再帰的にと書けることを利用して、後ろから計算していくことでこの重みの列を作り出している。
(同様に、収益も再帰的に表現できるので、一緒に作っている)
こうして更新に必要な情報を作った後は、初回訪問のときだけ価値を更新するように注意しながら、価値の更新を行っている。
(後ろから更新に必要な情報を作っているときに同時に価値の更新も行ってしまうと、初回訪問でないタイミングでも価値の更新を行ってしまう可能性があることに注意)
これを実行すると、次のような感じ。
------------------------------ コース1 ------------------------------ -------------------- [10000] -------------------- ---------- <start: 7,35,0,0> position: ( 7, 35), speed: ( 0, 0), action: ( 1, 0), reward: -1 position: ( 8, 35), speed: ( 1, 0), action: ( 0, -1), reward: -1 position: ( 9, 34), speed: ( 1, -1), action: (-1, -1), reward: -1 position: ( 9, 32), speed: ( 0, -2), action: ( 0, 0), reward: -1 position: ( 9, 30), speed: ( 0, -2), action: ( 0, -1), reward: -1 position: ( 9, 27), speed: ( 0, -3), action: ( 0, 0), reward: -1 position: ( 9, 24), speed: ( 0, -3), action: ( 0, -1), reward: -1 position: ( 9, 20), speed: ( 0, -4), action: ( 1, 1), reward: -1 position: (10, 17), speed: ( 1, -3), action: ( 0, 0), reward: -1 position: (11, 14), speed: ( 1, -3), action: ( 0, 0), reward: -1 position: (12, 11), speed: ( 1, -3), action: ( 1, 1), reward: -1 position: (14, 9), speed: ( 2, -2), action: ( 1, 1), reward: -1 position: (17, 8), speed: ( 3, -1), action: ( 0, 0), reward: -1 total rewards: -13 ---------- ---------- <start: 8,35,0,0> position: ( 8, 35), speed: ( 0, 0), action: ( 1, -1), reward: -1 position: ( 9, 34), speed: ( 1, -1), action: (-1, -1), reward: -1 position: ( 9, 32), speed: ( 0, -2), action: ( 0, 0), reward: -1 position: ( 9, 30), speed: ( 0, -2), action: ( 0, -1), reward: -1 position: ( 9, 27), speed: ( 0, -3), action: ( 0, 0), reward: -1 position: ( 9, 24), speed: ( 0, -3), action: ( 0, -1), reward: -1 position: ( 9, 20), speed: ( 0, -4), action: ( 1, 1), reward: -1 position: (10, 17), speed: ( 1, -3), action: ( 0, 0), reward: -1 position: (11, 14), speed: ( 1, -3), action: ( 0, 0), reward: -1 position: (12, 11), speed: ( 1, -3), action: ( 1, 1), reward: -1 position: (14, 9), speed: ( 2, -2), action: ( 1, 1), reward: -1 position: (17, 8), speed: ( 3, -1), action: ( 0, 0), reward: -1 total rewards: -12 ---------- ---------- <start: 9,35,0,0> position: ( 9, 35), speed: ( 0, 0), action: ( 0, -1), reward: -1 position: ( 9, 34), speed: ( 0, -1), action: ( 0, -1), reward: -1 position: ( 9, 32), speed: ( 0, -2), action: ( 0, 0), reward: -1 position: ( 9, 30), speed: ( 0, -2), action: ( 0, -1), reward: -1 position: ( 9, 27), speed: ( 0, -3), action: ( 0, 0), reward: -1 position: ( 9, 24), speed: ( 0, -3), action: ( 0, -1), reward: -1 position: ( 9, 20), speed: ( 0, -4), action: ( 1, 1), reward: -1 position: (10, 17), speed: ( 1, -3), action: ( 0, 0), reward: -1 position: (11, 14), speed: ( 1, -3), action: ( 0, 0), reward: -1 position: (12, 11), speed: ( 1, -3), action: ( 1, 1), reward: -1 position: (14, 9), speed: ( 2, -2), action: ( 1, 1), reward: -1 position: (17, 8), speed: ( 3, -1), action: ( 0, 0), reward: -1 total rewards: -12 ---------- ---------- <start: 10,35,0,0> position: (10, 35), speed: ( 0, 0), action: ( 0, -1), reward: -1 position: (10, 34), speed: ( 0, -1), action: (-1, -1), reward: -1 position: ( 9, 32), speed: (-1, -2), action: (-1, 0), reward: -1 position: ( 7, 30), speed: (-2, -2), action: ( 1, 0), reward: -1 position: ( 6, 28), speed: (-1, -2), action: ( 1, -1), reward: -1 position: ( 6, 25), speed: ( 0, -3), action: ( 1, -1), reward: -1 position: ( 7, 21), speed: ( 1, -4), action: ( 0, 0), reward: -1 position: ( 8, 17), speed: ( 1, -4), action: ( 1, 0), reward: -1 position: (10, 13), speed: ( 2, -4), action: ( 1, 1), reward: -1 position: (13, 10), speed: ( 3, -3), action: ( 0, 1), reward: -1 position: (16, 8), speed: ( 3, -2), action: ( 1, 1), reward: -1 total rewards: -11 ---------- ---------- <start: 11,35,0,0> position: (11, 35), speed: ( 0, 0), action: (-1, 0), reward: -1 position: (10, 35), speed: (-1, 0), action: ( 1, -1), reward: -1 position: (10, 34), speed: ( 0, -1), action: (-1, -1), reward: -1 position: ( 9, 32), speed: (-1, -2), action: (-1, 0), reward: -1 position: ( 7, 30), speed: (-2, -2), action: ( 1, 0), reward: -1 position: ( 6, 28), speed: (-1, -2), action: ( 1, -1), reward: -1 position: ( 6, 25), speed: ( 0, -3), action: ( 1, -1), reward: -1 position: ( 7, 21), speed: ( 1, -4), action: ( 0, 0), reward: -1 position: ( 8, 17), speed: ( 1, -4), action: ( 1, 0), reward: -1 position: (10, 13), speed: ( 2, -4), action: ( 1, 1), reward: -1 position: (13, 10), speed: ( 3, -3), action: ( 0, 1), reward: -1 position: (16, 8), speed: ( 3, -2), action: ( 1, 1), reward: -1 total rewards: -12 ---------- ---------- <start: 12,35,0,0> position: (12, 35), speed: ( 0, 0), action: (-1, 0), reward: -1 position: (11, 35), speed: (-1, 0), action: ( 0, -1), reward: -1 position: (10, 34), speed: (-1, -1), action: ( 0, -1), reward: -1 position: ( 9, 32), speed: (-1, -2), action: (-1, 0), reward: -1 position: ( 7, 30), speed: (-2, -2), action: ( 1, 0), reward: -1 position: ( 6, 28), speed: (-1, -2), action: ( 1, -1), reward: -1 position: ( 6, 25), speed: ( 0, -3), action: ( 1, -1), reward: -1 position: ( 7, 21), speed: ( 1, -4), action: ( 0, 0), reward: -1 position: ( 8, 17), speed: ( 1, -4), action: ( 1, 0), reward: -1 position: (10, 13), speed: ( 2, -4), action: ( 1, 1), reward: -1 position: (13, 10), speed: ( 3, -3), action: ( 0, 1), reward: -1 position: (16, 8), speed: ( 3, -2), action: ( 1, 1), reward: -1 total rewards: -12 ---------- 〜省略〜 -------------------- [100000] -------------------- ---------- <start: 7,35,0,0> position: ( 7, 35), speed: ( 0, 0), action: ( 1, 0), reward: -1 position: ( 8, 35), speed: ( 1, 0), action: ( 0, -1), reward: -1 position: ( 9, 34), speed: ( 1, -1), action: (-1, -1), reward: -1 position: ( 9, 32), speed: ( 0, -2), action: (-1, 1), reward: -1 position: ( 8, 31), speed: (-1, -1), action: ( 0, -1), reward: -1 position: ( 7, 29), speed: (-1, -2), action: ( 1, -1), reward: -1 position: ( 7, 26), speed: ( 0, -3), action: ( 1, -1), reward: -1 position: ( 8, 22), speed: ( 1, -4), action: ( 1, 0), reward: -1 position: (10, 18), speed: ( 2, -4), action: (-1, 0), reward: -1 position: (11, 14), speed: ( 1, -4), action: ( 1, 0), reward: -1 position: (13, 10), speed: ( 2, -4), action: ( 1, 1), reward: -1 position: (16, 7), speed: ( 3, -3), action: ( 1, 1), reward: -1 total rewards: -12 ---------- ---------- <start: 8,35,0,0> position: ( 8, 35), speed: ( 0, 0), action: ( 1, -1), reward: -1 position: ( 9, 34), speed: ( 1, -1), action: (-1, -1), reward: -1 position: ( 9, 32), speed: ( 0, -2), action: (-1, 1), reward: -1 position: ( 8, 31), speed: (-1, -1), action: ( 0, -1), reward: -1 position: ( 7, 29), speed: (-1, -2), action: ( 1, -1), reward: -1 position: ( 7, 26), speed: ( 0, -3), action: ( 1, -1), reward: -1 position: ( 8, 22), speed: ( 1, -4), action: ( 1, 0), reward: -1 position: (10, 18), speed: ( 2, -4), action: (-1, 0), reward: -1 position: (11, 14), speed: ( 1, -4), action: ( 1, 0), reward: -1 position: (13, 10), speed: ( 2, -4), action: ( 1, 1), reward: -1 position: (16, 7), speed: ( 3, -3), action: ( 1, 1), reward: -1 total rewards: -11 ---------- ---------- <start: 9,35,0,0> position: ( 9, 35), speed: ( 0, 0), action: ( 0, -1), reward: -1 position: ( 9, 34), speed: ( 0, -1), action: ( 0, -1), reward: -1 position: ( 9, 32), speed: ( 0, -2), action: (-1, 1), reward: -1 position: ( 8, 31), speed: (-1, -1), action: ( 0, -1), reward: -1 position: ( 7, 29), speed: (-1, -2), action: ( 1, -1), reward: -1 position: ( 7, 26), speed: ( 0, -3), action: ( 1, -1), reward: -1 position: ( 8, 22), speed: ( 1, -4), action: ( 1, 0), reward: -1 position: (10, 18), speed: ( 2, -4), action: (-1, 0), reward: -1 position: (11, 14), speed: ( 1, -4), action: ( 1, 0), reward: -1 position: (13, 10), speed: ( 2, -4), action: ( 1, 1), reward: -1 position: (16, 7), speed: ( 3, -3), action: ( 1, 1), reward: -1 total rewards: -11 ---------- ---------- <start: 10,35,0,0> position: (10, 35), speed: ( 0, 0), action: ( 0, -1), reward: -1 position: (10, 34), speed: ( 0, -1), action: (-1, -1), reward: -1 position: ( 9, 32), speed: (-1, -2), action: (-1, 0), reward: -1 position: ( 7, 30), speed: (-2, -2), action: ( 1, 0), reward: -1 position: ( 6, 28), speed: (-1, -2), action: ( 1, -1), reward: -1 position: ( 6, 25), speed: ( 0, -3), action: ( 1, -1), reward: -1 position: ( 7, 21), speed: ( 1, -4), action: ( 0, 0), reward: -1 position: ( 8, 17), speed: ( 1, -4), action: ( 1, 0), reward: -1 position: (10, 13), speed: ( 2, -4), action: ( 1, 1), reward: -1 position: (13, 10), speed: ( 3, -3), action: ( 0, 1), reward: -1 position: (16, 8), speed: ( 3, -2), action: ( 1, 1), reward: -1 total rewards: -11 ---------- ---------- <start: 11,35,0,0> position: (11, 35), speed: ( 0, 0), action: (-1, 0), reward: -1 position: (10, 35), speed: (-1, 0), action: ( 1, -1), reward: -1 position: (10, 34), speed: ( 0, -1), action: (-1, -1), reward: -1 position: ( 9, 32), speed: (-1, -2), action: (-1, 0), reward: -1 position: ( 7, 30), speed: (-2, -2), action: ( 1, 0), reward: -1 position: ( 6, 28), speed: (-1, -2), action: ( 1, -1), reward: -1 position: ( 6, 25), speed: ( 0, -3), action: ( 1, -1), reward: -1 position: ( 7, 21), speed: ( 1, -4), action: ( 0, 0), reward: -1 position: ( 8, 17), speed: ( 1, -4), action: ( 1, 0), reward: -1 position: (10, 13), speed: ( 2, -4), action: ( 1, 1), reward: -1 position: (13, 10), speed: ( 3, -3), action: ( 0, 1), reward: -1 position: (16, 8), speed: ( 3, -2), action: ( 1, 1), reward: -1 total rewards: -12 ---------- ---------- <start: 12,35,0,0> position: (12, 35), speed: ( 0, 0), action: (-1, 0), reward: -1 position: (11, 35), speed: (-1, 0), action: ( 0, -1), reward: -1 position: (10, 34), speed: (-1, -1), action: ( 0, -1), reward: -1 position: ( 9, 32), speed: (-1, -2), action: (-1, 0), reward: -1 position: ( 7, 30), speed: (-2, -2), action: ( 1, 0), reward: -1 position: ( 6, 28), speed: (-1, -2), action: ( 1, -1), reward: -1 position: ( 6, 25), speed: ( 0, -3), action: ( 1, -1), reward: -1 position: ( 7, 21), speed: ( 1, -4), action: ( 0, 0), reward: -1 position: ( 8, 17), speed: ( 1, -4), action: ( 1, 0), reward: -1 position: (10, 13), speed: ( 2, -4), action: ( 1, 1), reward: -1 position: (13, 10), speed: ( 3, -3), action: ( 0, 1), reward: -1 position: (16, 8), speed: ( 3, -2), action: ( 1, 1), reward: -1 total rewards: -12 ---------- ------------------------------ コース2 ------------------------------ 〜省略〜
方策オン型モンテカルロ制御に比べて、早いうちから最適な方策に近づいていることが分かる。
これは、重みをつけて価値を更新しているので、実際に試行した回数よりも実質的に多く学習していることになるからだと思う。
一方、ある程度まで進むと、それ以上学習が進んでいないようにも見える。
これは逆に、実質的なステップ幅が大きくなっているせいで、収束がしづらくなっているのかもしれない。
(ちゃんと検証していないので、断言できないけど・・・)
今日はここまで!
- 作者: Richard S.Sutton,Andrew G.Barto,三上貞芳,皆川雅章
- 出版社/メーカー: 森北出版
- 発売日: 2000/12/01
- メディア: 単行本(ソフトカバー)
- 購入: 5人 クリック: 76回
- この商品を含むブログ (29件) を見る