◾Intro

🔻references

◾Main

🔻Grid World example

Untitled

🔸구성

Grid world : 가로세로 격자구조. (4x3) (markov decision process의 예시 중 하나이다)
Agent : (3,1)에 있는 로봇.
- 강화학습을 통해 스스로 학습하는 컴퓨터 (=학습의 주체)
- grid world 내부에 있어야 한다. (4x3) 12칸 내에 있어야 하며 경계선(칸과 칸 사이의 선)에 존재할 수도 없다.
State : Agent가 위치할 수 있는 곳
- 현재 grid world는 (2,2) 위치가 agent가 이동할 수 없는 위치로 지정되어 있으므로 총 11개의 state가 있다.
- $S = \lbrace (1,1), (1,2),(1,3),(2,1),(2,3),\cdots,(4,3)\rbrace$
Action : Agent의 행동
- Agent는 상,하,좌,우로 한 칸씩 움직일 수 있다.
- $A=\lbrace \text{north, south, east, west} \rbrace$
- Noisy Movement : agent는 항상 계획한 대로만 움직이지는 않는다. → Action에는 랜덤성이 적용된다.
  
  e.g.) 만약 north로 움직이라는 action을 agent가 받어라도, 실제로 action의 확률은 north(80%), west(10%), east(10%)처럼 나타난다. ~~(agent가 만약 이전에 north로 이동했다면, 왔던 길인 south는 선택지에 없다. 이유는 아래 small negative rewards 참고)~~ 만약 agent의 action 방향이 wall(벽)(=north)이라면 agent는 “stays” 행동을 취한다.
- State Transition Probability : Action의 Noisy movement에 대한 확률
Reward : Agent가 특정 State에 도착했을 때 주어지는 보상.
- Big Rewards : 특정 state에 큰 보상을 설정한다. good(+1), bad(-1)
- Small Negative Rewards : Big rewards 외의 state들은 small negative reward로 c값을 갖게 된다.
  - why need? 만약 small negative reward가 없다면, agent가 (east→west→east→west→ … ) 처럼 좌우로만 왔다 갔다하는 반복 행동을 하면 게임이 끝나지 않는다. 그러나 small negative reward를 -0.1로 설정하면 이런 반복행동은 계속해서 negative score를 증가시킨다. agent는 rewards를 증가시키는 방향으로 행동할 것이기 때문에 big rewards 외의 경우에 small negative reward를 설정함으로써 agent의 의미없는 반복행동을 방지한다.

🔸Agent의 목표와 Policy

select an action for each state to maximize the total sum of rewards.

Policy : 각 state이 취하는 action ($\pi$로 표현한다.)

e.g.) state(1,1)의 policy가 north이면 action은 north로 결정된다. 그러나 stochastic grid world에서는 state transition probability에 의해 랜덤성이 추가됨.
Optimal Policy : total sum of rewards를 maximize하는 Policy

🔸Actions in grid world

Untitled

Deterministic grid world
- agent가 state의 policy대로 action을 취한다.
  
  ⇒ policy가 정해지면 하나의 episode만 나타난다.