Title: Exploiting Exploration: Past Outcomes and Future Actions
Abstract: Applying past knowledge to future actions is crucial for adaptive choice behavior. Here, in this issue of Neuron, Donahue et al., 2013Donahue C.H. Seo H. Lee D. Neuron. 2013; 80 (this issue): 223-234Abstract Full Text Full Text PDF PubMed Scopus (40) Google Scholar show that reward enhances neural coding reliability for actions in a network of frontal and parietal brain areas. Applying past knowledge to future actions is crucial for adaptive choice behavior. Here, in this issue of Neuron, Donahue et al., 2013Donahue C.H. Seo H. Lee D. Neuron. 2013; 80 (this issue): 223-234Abstract Full Text Full Text PDF PubMed Scopus (40) Google Scholar show that reward enhances neural coding reliability for actions in a network of frontal and parietal brain areas. Consider a soccer player lining up for a penalty kick, who knows from past experience that the goalie has a slight bias for rightward saves but only at the end of a match. To use that information, he must weigh the context, appropriately value different alternatives, and select and execute an action. Thus, the process of applying prior knowledge to future behavior involves a number of related cognitive functions—including valuation, memory, action selection, and cognitive control—necessary for adaptive decision making in a dynamic environment. An influential framework integrating these processes arises from computational theories of machine learning (Sutton and Barto, 1998Sutton R.S. Barto A.G. Reinforcement Learning: An Introduction. MIT Press, Cambridge1998Google Scholar). The core idea in these reinforcement learning (RL) models is that agents acquire information about the value of actions through interaction with the environment, using reward to guide the learning process. To update the value of actions, such models employ error-driven learning using a quantity known as reward prediction error (RPE), the difference between reward received and reward expected. For example, actions that produce reward that is better than expected have their associated values increased. In stable environments, this procedure produces value estimates that converge appropriately to the average reward. Neuroscientific interest in RL emerged with the discovery that midbrain dopamine neuron activity in classical and operant conditioning tasks carries an RPE signal (Schultz et al., 1997Schultz W. Dayan P. Montague P.R. Science. 1997; 275: 1593-1599Crossref PubMed Scopus (5889) Google Scholar). Subsequent work has successfully applied the RL framework in a variety of brain systems and behavioral paradigms to characterize value- and choice-related neural activity and value-guided decision behavior (Lee et al., 2012Lee D. Seo H. Jung M.W. Annu. Rev. Neurosci. 2012; 35: 287-308Crossref PubMed Scopus (279) Google Scholar). Reinforcement learning works remarkably well in stable environments but faces an additional important challenge when the state of the world is uncertain and changing (Sutton and Barto, 1998Sutton R.S. Barto A.G. Reinforcement Learning: An Introduction. MIT Press, Cambridge1998Google Scholar). Given an evaluation of the possible actions, the goal of an agent is to exploit current knowledge in choosing the highest valued option. However, changing conditions in a dynamic environment necessitate an exploration of nonoptimal alternatives in order to maintain accurately updated values. Solving this tradeoff between exploration and exploitation is a fundamental problem in learning through reinforcement. A particular question of interest is how the brain switches from exploitative behavior, which is a natural byproduct of a value-guided decision system, to strategic exploratory behavior, which forgoes current value maximization for a more global optimality. Recent progress has identified neural substrates involved in exploration, such as the neuromodulatory noradrenergic system (Usher et al., 1999Usher M. Cohen J.D. Servan-Schreiber D. Rajkowski J. Aston-Jones G. Science. 1999; 283: 549-554Crossref PubMed Scopus (544) Google Scholar) and frontopolar cortex (Daw et al., 2006Daw N.D. O’Doherty J.P. Dayan P. Seymour B. Dolan R.J. Nature. 2006; 441: 876-879Crossref PubMed Scopus (1388) Google Scholar), but the full extent of cortical circuits involved in strategic exploration is unknown. In this issue of Neuron, Donahue et al., 2013Donahue C.H. Seo H. Lee D. Neuron. 2013; 80 (this issue): 223-234Abstract Full Text Full Text PDF PubMed Scopus (40) Google Scholar examine the relationship between strategic exploration and a network of cortical regions related to saccade selection, execution, and postsaccade processing. Action selection in the eye movement system has long been a model of neurobiological decision making (Glimcher, 2003Glimcher P.W. Annu. Rev. Neurosci. 2003; 26: 133-179Crossref PubMed Scopus (250) Google Scholar), and lesion and electrophysiology studies have identified core sensorimotor structures involved in the decision process (Figure 1A). A key structure in this network is the lateral intraparietal (LIP) area, which receives afferents from higher-order sensory areas and displays both sensory and motor modulation. Consistent with a central role in decision making, saccade-selective activity in LIP represents the information necessary for decision formation, for example, accumulating evidence for a given response in perceptual discrimination tasks (Shadlen and Newsome, 2001Shadlen M.N. Newsome W.T. J. Neurophysiol. 2001; 86: 1916-1936PubMed Google Scholar). LIP efferents project to the frontal eye field (FEF) and superior colliculus (SC), and together this network (along with the caudate in the basal ganglia) plays an essential role in saccade selection and execution. FEF and SC are necessary for saccade generation: saccades are initiated only when movement-related activity reaches a fixed threshold, microstimulation in these structures elicits fixed vector saccades, and lesions disrupt saccade initiation. Consistent with the anatomy, LIP appears to play a more upstream role. While lesions in LIP leave saccades to single targets relatively intact, they produce substantial deficits in target selection from multiple alternatives. Importantly, action value information strongly modulates neural activity in these areas during the choice process, consistent with an integrated evaluation and decision-making network (Glimcher, 2003Glimcher P.W. Annu. Rev. Neurosci. 2003; 26: 133-179Crossref PubMed Scopus (250) Google Scholar). Note that this system, which selects saccades based on value, is designed to implement exploitation behavior. However, there are a number of additional brain areas anatomically and functionally linked to these core sensorimotor circuits that play a different, less transparent role in choice behavior. These areas include three interconnected regions (among others) in frontal cortex: the supplementary eye field (SEF), anterior cingulate cortex (ACC), and dorsolateral prefrontal cortex (DLPFC). These areas are related to saccades but not necessary for action initiation or execution; instead, activity in these regions represent a variety of error- and reward-related signals that may be involved in performance monitoring and executive control of the gaze process (Ito et al., 2003Ito S. Stuphorn V. Brown J.W. Schall J.D. Science. 2003; 302: 120-122Crossref PubMed Scopus (445) Google Scholar, Schall et al., 2002Schall J.D. Stuphorn V. Brown J.W. Neuron. 2002; 36: 309-322Abstract Full Text Full Text PDF PubMed Scopus (232) Google Scholar, Stuphorn et al., 2010Stuphorn V. Brown J.W. Schall J.D. J. Neurophysiol. 2010; 103: 801-816Crossref PubMed Scopus (105) Google Scholar). Neural activity in these areas, most notably in SEF and ACC, often occurs after action completion, consistent with a role in reward processing and outcome-based updating. Similar evaluative signals occur in humans, where strong negative potentials are recorded over medial frontal cortex when errors are made in simple behavioral tasks (Gehring et al., 1993Gehring W.J. Gross B. Coles M.G.H. Meyer D.E. Donchin E. Psychol. Sci. 1993; 4: 385-390Crossref Scopus (2153) Google Scholar). In contrast to the core oculomotor network defined by LIP and FEF, postaction processing in these frontal areas suggests a role in executive control and a potential involvement in regulating exploratory behavior. In this study, Donahue et al., 2013Donahue C.H. Seo H. Lee D. Neuron. 2013; 80 (this issue): 223-234Abstract Full Text Full Text PDF PubMed Scopus (40) Google Scholar examine the neural basis of strategic exploration by taking advantage of an impressive data set of recordings from SEF, DLPFC, ACC, and LIP neurons and using two behavioral tasks designed to elicit either exploitation or exploration. To elicit exploitation behavior, they used a simple visual search task, where the location of the rewarded target was explicitly cued in each trial. In this task, reward was determined by a fixed rule and monkeys simply had to choose the high-value, cued target. To elicit exploration behavior, they used a competitive game known as matching pennies. In this task, played against a computer opponent, monkeys chose between two identical targets. Much like the soccer player taking a penalty kick, reward outcome depended on the behavior of the opponent: the monkey was rewarded only if he chose the same target chosen by the computer (revealed after the animal’s choice). Importantly, the computer opponent employed an algorithm that took advantage of any statistical biases evident in the animal’s behavior. Thus, to achieve optimal reinforcement rates, the monkey should on average choose each target equally and with independent probability, irrespective of past choices and outcomes. Using this form of competitive game has two experimental advantages. First, by penalizing behavioral biases, this task encourages strategic exploration rather than deterministic behavior. Second, the resulting stochastic behavior dissociates past actions and reward from future choices, enabling the experimenters to determine whether neural activity reflects the influence of previous knowledge or current action planning. Donahue et al., 2013Donahue C.H. Seo H. Lee D. Neuron. 2013; 80 (this issue): 223-234Abstract Full Text Full Text PDF PubMed Scopus (40) Google Scholar find that previous reward and actions influence activity during the matching pennies task in all four cortical regions examined but with some notable and important differences between areas. In a given trial, during the time before the monkey made a choice, a significant fraction of neurons in all four areas signaled the choice and reward outcome in the previous trial. Notably, neurons in SEF, DLPFC, and LIP—but not ACC—also coded the interaction between previous reward and choice (Figure 1B). This interaction reflects a gating of action coding by reward, such that neural discrimination between past actions is enhanced if reward was received (inset). Because past and future choices were dissociated in this task, Donahue et al., 2013Donahue C.H. Seo H. Lee D. Neuron. 2013; 80 (this issue): 223-234Abstract Full Text Full Text PDF PubMed Scopus (40) Google Scholar further show that this enhanced discriminability reflects information about past but not upcoming choice. Intriguingly, Donahue et al., 2013Donahue C.H. Seo H. Lee D. Neuron. 2013; 80 (this issue): 223-234Abstract Full Text Full Text PDF PubMed Scopus (40) Google Scholar find that the SEF may play a particularly important role in governing exploratory behavior. While performance in the matching pennies task approached optimal randomness, the monkeys showed a slight but significant bias in their behavior. Specifically, they adopted an asymmetric win-stay lose-switch strategy, repeating previous choices if rewarded and switching targets if unrewarded in the previous trial. Their slight tendency to win-stay more often than lose-shift produced a small and fluctuating bias to choose the same target in successive trials. Notably, Donahue et al., 2013Donahue C.H. Seo H. Lee D. Neuron. 2013; 80 (this issue): 223-234Abstract Full Text Full Text PDF PubMed Scopus (40) Google Scholar found that switching behavior was significantly correlated with the reward-driven improvements in neural action decoding but only for the SEF. Furthermore, enhanced discriminability in SEF was largely attenuated during the visual search task, suggesting that it may play a unique role in guiding exploration behavior. These findings raise important questions about how these different cortical regions interact in reinforcement-guided behavior. Enhanced SEF action coding following reward appears to facilitate subsequent exploratory switching behavior, but exactly how it does so is not known. One likely candidate is the strong projections SEF sends to the FEF. Microstimulation of SEF neurons can produce either excitatory or suppressive effects on FEF-mediated saccade initiation, consistent with a contextual form of executive control sensitive to task demands (Stuphorn and Schall, 2006Stuphorn V. Schall J.D. Nat. Neurosci. 2006; 9: 925-931Crossref PubMed Scopus (182) Google Scholar). Thus, SEF may drive exploratory behavior by proactively influencing the saccade selection process in FEF, perhaps by overriding the default exploitation behavior driven by reinforcement learning. Characterizing the nature of this interaction will be an important focus of future research. Ultimately, these results point toward a more nuanced view of reinforcement learning in the brain. Traditional RL algorithms, including many of those used to study decision-related neural activity, focus on learning the values of actions and choose according to previously received reward. In contrast to such model-free RL, increasing work has focused on model-based learning strategies, which carry an internal model of the world and attempt to learn the sequential contingencies of events, actions, and reward (Doll et al., 2012Doll B.B. Simon D.A. Daw N.D. Curr. Opin. Neurobiol. 2012; 22: 1075-1081Crossref PubMed Scopus (224) Google Scholar). In the complicated dynamics of a competitive game, reward is determined not by the choice of a particular action but by a sequence of actions. Thus, a monkey playing matching pennies must learn strategies rather than specific actions. This complexity may explain why circuits like ACC and DLPFC, which display significant choice- and reward-related activity related to value-guided behavior, apparently contribute little to strategic exploration behavior. Much work remains to be done in characterizing the interconnected brain regions responsible for exploitation, exploration, and their relative balance. These current findings provide an important roadmap for future research at the intersection of reinforcement learning and strategic behavior. Cortical Signals for Rewarded Actions and Strategic ExplorationDonahue et al.NeuronSeptember 5, 2013In BriefDonahue et al. show that rewarded actions are encoded more robustly than unrewarded actions in multiple regions of the primate association cortex. Furthermore, rewarded action signals in the supplementary eye field are correlated with animals’ choices only when no sensory cues explicitly signal correct actions. Full-Text PDF Open Archive