We derive a reward prediction error (RPE) by using a model-free SARSA learner, a variant of classic reinforcement learning (Sutton and Barto, 1998) (see also Figure 2). The name refers to the experience tuple <s,a,r,s’,a’>, where s and s’ refer to the current and next state, a and a’ to the current and next action, and r represents the obtained rewards. The learner attempts to estimate a “state-action value” QSARSA (s, a), for each state and action. These values are initialized to zero at the start of the experiment, and then at each step of the task the value of the state and action actually experienced, QSARSA (s, a), is updated in light of the reward obtained in the next state, r(s’) , and the estimated value QSARSA(s’,a’) of the next state and action. In particular, an RPE δRPE is computed as: δRPE=r(s′)+γQSARSA(s′,a′)−QSARSA(s,a) where γ is the temporal discount factor, which we fixed to γ = 1 because the 2-step task does not allow subjects to choose between rewards at different delays.