Estimated transition probabilities are used together with the rewards at the end states, r(s), (which were taken as given since the participants were instructed in them) to compute state-action values QFWD as the expectation over the value of the successor state. This is done by dynamic programming, i.e., recursively evaluating the Bellman equation defining the state-action values at each level in terms of those at the next level. Here, QFWD(s,a) = 0 for the terminal reward states at the bottom of the tree, and for the other states: QFWD(s,a)=∑s′T(s,a,s′)×(r(s′)+argmaxa′QFWD(s′,a′))