◾Intro

🔻references

Monte Carlo = MC
Monte Carlo란? 반복적인 랜덤 샘플링(repeated random sampling)을 사용하여 특정한 수치 결과(numerical result)를 얻어내는 계산 방법(computations).

computations relied on repeated random sampling to obtain numerical results.
강화학습에는 크게 MC와 TD(temporal difference) 방식이 존재한다.

모든 RL이 tabular updating method를 사용하듯 MC도 마찬가지이다.
- RL = Rainforcement Learning
- tabular = action-value function에 대한 table을 의미 = 간단히 Q-table이라 하기도 한다.
model-free : state transition probability에 대해 모르는 상태이다.
MC Policy Iteration은 GPI(Generalized Policy Iteration)을 기반으로 한다. 다른 RL과의 차이점은 episode-by-episode로 샘플을 업데이트한다는 점이다. (MC는 한 episode가 끝나면 그것을 기반으로 PE, PI를 진행하기 때문에 episode-by-episode인 것.)

Episode-by-episode GPI means that PE and PI are alternately repeated every episode.
- episode task란? 유한한 시간 안에 종료되는 task
- continuous task? 끝이 없는, 계속 진행되는 task
- PE = Policy Evaluation
  - MC의 PE는 $\mathcal{Q}(s,a)=q_\pi(s,a)$를 평가
- PI = Policy Improvement
  - MC의 PI는 $\epsilon\text{-greedy PI}$를 사용
sample multi-step backup
- 하나의 step이 terminal state까지 이동해야 episode 하나가 끝나는 구조.

⇒ MC learns from the entire trajectories(궤적) of sampled episodes of real experiences by updating after every single episode.