Subjects learned the task in 3 weeks with minimal shaping and performed an average of 576 ± 174 (mean ± SD) trials per day thereafter (Table 1). Our behavioral dataset used data from day 22 of training onward (n = 17 mice, 400 sessions, 230,237 trials). Subjects tracked which first-step action had higher reward probability (Figures 1D and 1E), choosing the correct option at the end of non-neutral blocks with probability 0.68 ± 0.03 (mean ± SD). Choice probabilities adapted faster following reversals in the action-state transition probabilities (exponential fit tau = 17.6 trials), compared with reversals in the reward probabilities (tau = 22.7 trials, p = 0.009, bootstrap test; Figure 1E).Table 1Two-Step Task Parameter Changes over TrainingSession NumberReward Size (μl)Transition Probabilities (Common/Rare)Reward Probabilities (Good/Bad Side)1100.9/0.1first 40 trials all rewarded, subsequently 0.9/0.12–4100.9/0.10.9/0.15–66.50.9/0.10.9/0.17–840.9/0.10.9/0.19–1240.8/0.20.9/0.1≥1340.8/0.20.8/0.2