Chunk #39 — EXPERIMENTAL PROCEDURES — FORWARD Learner

Source: States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning.
Embedded: yes

Text

Estimated transition probabilities are used together with the rewards at the end states, r(s), (which were taken as given since the participants were instructed in them) to compute state-action values QFWD as the expectation over the value of the successor state. This is done by dynamic programming, i.e., recursively evaluating the Bellman equation defining the state-action values at each level in terms of those at the next level. Here, QFWD(s,a) = 0 for the terminal reward states at the bottom of the tree, and for the other states: QFWD(s,a)=∑s′T(s,a,s′)×(r(s′)+argmaxa′QFWD(s′,a′))