SARSA Epsilon-greedy Benchmark
Proposed by Rummery and Rinanjan in 1994, SARSA uses the partial trajectory s, a, r, s'
to calculate the current Q-estimate x = Q(s, a)
and the target estimate Q value y = r + gamma * Q(s', a')
(bootstrapping of the Bellman equation). The loss is calculated with L(x, y)
.