We used a Dynamic Programming approach to implement a FORWARD learner, which utilizes experience with state transitions to update an estimated state transition matrix T(s, a, s’) of transition probabilities. Each element of T(s, a, s’) therefore holds the current estimate of the probability of transitioning from state s to s’ given action a. These transitions are initialized to uniform distributions connecting each state and action to those on the next level of the tree. Upon each step, leaving state s and arriving in state s’ having taken action a, the FORWARD learner computes a state prediction error (SPE): δSPE=1−T(s,a,s′) and updates the probability T(s,a,s’) of the observed transition via: T(s,a,s′)=T(s,a,s′)+ηδSPE where η is a free parameter controlling the FORWARD learning rate. The estimated probabilities for all states not arrived in (i.e. for all states s” other than s’) are reduced according to T(s,a,s”) = T(s,a,s”) · (1–η), to ensure that the distribution remains normalized.