Active Inference: The Free Energy Principle in Mind, Brain, and Behavior
January 18th, 2025

Preface

Karl Friston

Active Inference is a way of understanding sentient behavior. The very fact that you are reading these lines means that you are engaging in Active Inference-namely, actively sampling the world-in a particular way- because you believe you will learn something. You are palpating this page with your eyes simply because this is the kind of action that will resolve uncer- tainty about what you will see next and-indeed-what these words convey. In short, Active Inference puts the action into perception, whereby perception is treated as perceptual inference or hypothesis testing. Active Inference goes even further and considers planning as inference-that is, inferring what you would do next to resolve uncertainty about your lived world.

To illustrate the simplicity of Active Inference-and what we are try- ing to explain-place your fingertips gently on your leg. Keep them there motionless for a second or two. Now, does your leg feel rough or smooth? If you had to move your fingers to evince a feeling of roughness or smooth- ness, you have discovered a fundament of Active Inference. To feel is to palpate. To see is to look. To hear is to listen. This palpation does not neces- sarily have to be overt-we can act covertly by directing our attention to this or that. In short, we are not simply trying to make sense of our sensa- tions; we have to actively create our sensorium. In what follows, we will see why this has to be the case and why everything that we perceive, do, or plan is in the compass of one existential imperative-self-evidencing.

Active Inference is not just about reading or epistemic foraging. It is, on one view, something that all creatures and particles do, in virtue of their existence. This might sound like a strong claim; however, it speaks to the fact that Active Inference inherits from a free energy principle that equates existence with self-evidencing and self-evidencing with an enactive sort of inference. However, this book is not concerned with the physics of sentient systems. Its focus is on the implications of this physics for understanding how the brain works.

This understanding is not an easy business, as witnessed by millennia of natural philosophy and centuries of neuroscience. Although one can find the roots of Active Inference in first principle accounts of self-organized behavior (i.e ., variational principles akin to Hamilton's principle of stationary action), first principles do not help very much when asking how a particular brain works and how it differs from another brain. For example, committing to the theory of evolution by natural selection does not help in the slightest when it comes to understanding why I have two eyes or speak French. This book is about using principles to scaffold key questions in neuroscience and artificial intelligence. To do this, we have to move beyond principles and get to grips with the mechanics to which the principles apply.

As such, Active Inference-and its accompanying Bayesian mechanics- is there to frame questions about how we perceive, plan, and act. Crucially, it does not aim to replace other frameworks, such as behavioral psychology, decision theory, and reinforcement learning. Rather, it hopes to embrace all those approaches that have proven so successful within a unified frame- work. In what follows, we will pay special attention to linking key con- structs from psychology, cognitive neuroscience, enactivism, ethology, and so on to the calculus of belief updating in Active Inference-and its associ- ated process theories.

By process theories, we refer to theories about how belief updating is real- ized by neuronal (and other biophysical) processes in the embodied brain and beyond. Work to date in Active Inference offers a fairly straightforward set of computational architectures and simulation tools to both model vari- ous aspects of a functioning brain and enable people to test hypotheses about different computational architectures. However, these tools only solve half the problem. At the heart of Active Inference lies a generative model- namely, a probabilistic representation of how unobservable causes in the world out there generate the observable consequences-our sensations. Getting the generative model right-as an apt explanation for the sentient behavior of any experimental subject or creature-is the big challenge.

This book tries to explain how to meet this challenge. The first part sets up the basic ideas and formalisms that are called on in the second part-to illustrate how they can be applied in practice. In short, this book is for people who want to use Active Inference to simulate and model sentient behavior, in the service of either scientific inquiry or, possibly, artificial intelligence. Thus it focuses on those ideas and procedures that are nec- essary to understand and implement an Active Inference scheme without getting distracted by the physics of sentient systems on the one hand or philosophy on the other.

A Note from Karl Friston

I have a confession to make. I did not write much of this book. Or, more precisely, I was not allowed to. This book's agenda calls for a crisp and clear writing style that is beyond me. Although I was allowed to slip in a few of my favorite words, what follows is a testament to Thomas and Giovanni, their deep understanding of the issues at hand, and, importantly, their theory of mind-in all senses.

Acknowledgments

We gratefully acknowledge invaluable input from our friends and colleagues-in particular, past and present members of the Theoretical Neurobiology group at the Wellcome Centre for Human Neuroimaging, University College London; the Cognition in Action (CONAN) Lab at the Institute of Cognitive Sciences and Technologies, National Research Council of Italy; and numerous international collaborators who have been integral to the development of the ideas presented in this book. This young but growing community has been more than generous in providing both intel- lectual support and motivation. Furthermore, we gratefully acknowledge Robert Prior and Anne-Marie Bono from MIT Press for kindly accompanying and advising us during the preparation of this book and Jakob Hohwy and other thoughtful reviewers for their guidance. Finally, we thank the funding agencies that provided financial support for our research: KJF was funded by a Wellcome Trust Principal Research Fellowship (Ref: 088130/Z/09/Z); GP was funded by the European Research Council under the Grant Agreement No. 820213 (ThinkAhead) and the European Union's Horizon 2020 Frame- work Programme for Research and Innovation under the Specific Grant Agreement No. 945539 (Human Brain Project SGA3).

1 Overview

Chance favors the prepared mind. -Louis Pasteur

1.1 Introduction

This chapter introduces the main question that Active Inference seeks to address: How do living organisms persist while engaging in adaptive exchanges with their environment? We discuss the motivation for addressing this question from a normative perspective, which starts from first principles and then unpacks their cognitive and biological implications. Furthermore, this chapter briefly introduces the structure of the book, including its sub- division into two parts: the first of which aims to help readers understand Active Inference, and the second of which aims to help them use it in their own research.

1.2 How Do Living Organisms Persist and Act Adaptively?

Living organisms constantly engage in reciprocal interactions with their environment (including other organisms). They emit actions that change the environment and receive sensory observations from it, as schematically illustrated in figure 1.1.

Living organisms can only maintain their bodily integrity by exerting adaptive control over the action-perception loop. This means acting to solicit sensory observations that either correspond to desired outcomes or goals (e.g ., the sensations that accompany secure nutrients and shelter for simple

Figure 1.1 An action-perception cycle reciprocally connecting a creature and its environment. The term environment is intentionally generic. In the examples that we discuss, it can include the physical world, the body, the social environment, and so on.
Figure 1.1 An action-perception cycle reciprocally connecting a creature and its environment. The term environment is intentionally generic. In the examples that we discuss, it can include the physical world, the body, the social environment, and so on.

organisms, or friends and jobs for more complex ones) or help in making sense of the world (e.g ., informing the organism about its surroundings).

Engaging in adaptive action-perception loops with the environment poses formidable challenges to living organisms. This is largely due to the recursive nature of the cycle, where each observation, solicited by the pre- vious action, changes how we decide on the next action, to solicit the next observation. The possibilities for control and adaptation are plentiful, but very few are useful. Yet during evolution, living organisms have man- aged to develop adaptive strategies to face the fundamental challenges of existence. These strategies vary in their level of cognitive sophistication, with simpler and more rigid solutions in simpler organisms (e.g ., follow- ing nutrient gradients in bacteria) and more cognitively demanding and flexible solutions in more advanced organisms (e.g ., planning to achieve distal goals in humans). These strategies also vary for the timescales at which they are selected and operate-ranging from simple responses to environmental threats or morphological adaptations that arise at an evo- lutionarily timescale, to behavioral patterns established during cultural or developmental learning, up to those requiring cognitive processes that operate at comparable timescales to action and perception (e.g ., attention and memory).

1.3 Active Inference: Behavior from First Principles

This diversity is a blessing for biology but challenging for formal theories of brain and mind. Broadly, there are two perspectives we could take on this. One perspective is that different biological adaptations, neural processes (e.g ., synaptic exchanges and brain networks), and cognitive mechanisms (e.g ., perception, attention, social interaction) are highly idiosyncratic and require dedicated explanations. This would lead to proliferation of theo- ries in fields like philosophy, psychology, neuroscience, ethology, biology, artificial intelligence, and robotics, with little hope for their unification. Another perspective is that, despite their diverse manifestations, the cen- tral aspects of behavior, cognition, and adaptation in living organisms are amenable to a coherent explanation from first principles.

These two possibilities map to two different research programs and, to some extent, different attitudes toward science: "neats" versus "scruffies" (terms due to Roger Shank). Neats always seek unification beyond the (appar- ent) heterogeneity of brain and mind phenomena. This usually corresponds to designing top-down, normative' models that start from first principles and try to derive as much as possible about brains and minds. Scruffies instead embrace the heterogeneity by focusing on details that demand dedicated explanations. This usually corresponds to designing bottom-up models that start from data and use whatever works to explain complex phenomena, including different explanations for different phenomena.

Is it possible to explain heterogenous biological and cognitive pheno- mena from first principles, as the neats assume? Is a unified framework to understand brain and mind possible?

This book answers these questions affirmatively and advances Active Inference as a normative approach to understand brain and mind. Our treatment of Active Inference starts from first principles and unpacks their cognitive and biological implications.

1.4 Structure of the Book

The book comprises two parts. These are aimed at readers who want to under- stand Active Inference (first part) and those who seek to use it for their own research (second part). The first part of the book introduces Active Inference both conceptually and formally, contextualizing it within current theories of cognition. The goal of this first part is to provide a comprehensive, formal, and self-contained introduction to Active Inference: its main constructs and implications for the study of brain and cognition.

The second part of the book illustrates specific examples of computa- tional models that use Active Inference to explain cognitive phenomena, such as perception, attention, memory, and planning. The goal of this sec- ond part is to help readers both understand existing computational models using Active Inference and design novel ones. In short, this book divides into theory (part 1) and practice (part 2).

1.4.1 Part 1: Active Inference in Theory

Active Inference is a normative framework to characterize Bayes-optimal2 behavior and cognition in living organisms. Its normative character is evinced in the idea that all facets of behavior and cognition in living organ- isms follow a unique imperative: minimizing the surprise of their sensory obser- vations. Surprise has to be interpreted in a technical sense: it measures how much an agent's current sensory observations differ from its preferred sen- sory observations-that is, those that preserve its integrity (e.g ., for a fish, being in the water). Importantly, minimizing surprise is not something that can be done by passively observing the environment: rather, agents must adaptively control their action-perception loops to solicit desired sensory observations. This is the active bit of Active Inference.

Minimizing surprise turns out to be a challenging problem for technical reasons that will become apparent later. Active Inference offers a solution to this problem. It assumes that even if living organisms cannot directly minimize their surprise, they can minimize a proxy-called (variational) free energy. This quantity can be minimized through neural computation in response to (and in anticipation of) sensory observations. This emphasis on free energy minimization discloses the relation between Active Infer- ence and the (first) principle that motivates it: the free energy principle (Friston 2009).

Free energy minimization seems a very abstract starting point to explain biological phenomena. However, it is possible to derive a number of for- mal and empirical implications from it and to address a number of central questions in cognitive and neural theory. These include how the variables involved in free energy minimization may be encoded in neuronal popu- lations; how the computations of minimized free energy map to specific cognitive processes, such as perception, action selection, and learning; and what kind of behaviors emerge when an Active Inference agent minimizes its free energy.

As the above list of topics exemplifies, in this book we are mainly concerned with Active Inference and free energy minimization at the level of living organisms-simpler (e.g ., bacterial) or more complex (e.g ., human)-and their behavioral, cognitive, social, and neural processes. This clarification is necessary to contextualize our treatment of Active Inference within the more general free energy principle (FEP), which discusses free energy minimization across a much wider range of biological phenomena and timescales beyond neural information processing-ranging from evolutionary to cellular and cultural (Friston, Levin et al. 2015; Isomura and Friston 2018; Palacios, Razi et al. 2020; Veissière et al. 2020)-which are beyond the scope of this book.

It is possible to motivate Active Inference by taking one of two roads: a high road and a low road; see figure 1.2. These two roads provide two dis- tinct but highly complementary perspectives on Active Inference: .

  • The high road to Active Inference starts from the question of how liv- ing organisms persist and act adaptively in the world and motivates

Active Inference as a normative solution to these problems. This high road perspective is useful to understand the normative nature of Active Inference: what living organisms must do to face their fundamental exis- tential challenges (minimize their free energy) and why (to vicariously minimize the surprise of their sensory observations).

The low road to Active Inference starts from the notion of the Bayesian brain, which casts the brain as an inference engine trying to optimize probabilistic representations of the causes of its sensory input. It then motivates Active Inference as a specific, variational approximation to the (otherwise intractable) inferential problem, which has a degree of biological plausibility. This low road perspective is useful to illustrate how Active Inference agents minimize their free energy-therefore illus- trating Active Inference not just as a principle but also as a mechanistic explanation (aka process theory) of cognitive functions and their neuro- nal underpinnings.

In chapter 2, we set out the low road perspective on Active Inference. We start from foundational theories that cast perception as a problem of statis- tical (Bayesian) inference (Helmholtz 1866) and their modern incarnation in the Bayesian brain hypothesis (Doya 2007). We will see that to perform such (perceptual) inference, living organisms must be equipped with-or embody-a probabilistic generative model of how their sensory observations are generated, which encodes beliefs (probability distributions) about both observable variables (sensory observations) and nonobservable (hidden) variables. We will extend this inferential view beyond perception to cover problems of action selection, planning, and learning.

In chapter 3, we will illustrate the complementary high road perspective on Active Inference. This chapter introduces the FEP and the imperative for biological organisms to minimize surprise. Further to this, it unpacks how this principle encompasses the dynamics of self-organization and the preservation of a statistical boundary or Markov blanket that maintains sepa- ration from the environment. This is vital in maintaining the integrity of biological creatures, and it is central to their autopoiesis.

In chapter 4, we will unpack Active Inference more formally. This chapter takes its cue from the discussion of the Bayesian brain in chapter 2 and sets out the mathematical relationship between the self-evidencing dynamics of chapter 3 and variational inference. In addition, this chapter sets out two sorts of generative model used to formulate Active Inference problems. These include the partially observed Markov decision processes used for decision- making and planning and the continuous time dynamical models that inter- face with sensory receptors and muscles. Finally, we see how free energy minimization for each of these models manifests as dynamic belief updating.

In chapter 5, we will move from formal treatments to biological impli- cations of Active Inference. By starting from the premise that "everything that changes in the brain must minimize free energy" (Friston 2009), we will discuss how the specific quantities involved in the free energy mini- mization (e.g ., prediction, prediction error, and precision signals) manifest in neuronal dynamics. This aids in mapping the abstract computational principles of Active Inference to specific neural computations that can be executed by physiological substrates. This is important in forming hypoth- eses under this framework and ensures that these are answerable to mea- sured data. In other words, chapter 5 sets out the process theory associated with Active Inference.

Throughout the first part of the book, we will discuss several characteris- tic aspects of Active Inference. These highlight the ways in which it is differ- ent from alternative frameworks that seek to explain biological regulation and cognition-some of which we preview here. .

Under Active Inference, perception and action are two complementary ways to fulfill the same imperative: minimization of free energy. Percep- tion minimizes free energy (and surprise) by (Bayesian) belief updating or changing your mind, thus making your beliefs compatible with sen- sory observations. Instead, action minimizes free energy (and surprise) by changing the world to make it more compatible with your beliefs and goals. This unification of cognitive functions marks a fundamental dif- ference between Active Inference and other approaches that treat action and perception in isolation from one another. Learning is yet another way to minimize free energy. However, it is not fundamentally different from perception; it simply operates at a slower timescale. The comple- mentarity between perception and action will be unpacked in chapter 2.

In addition to driving action selection in the present to change currently available sensory data, the Active Inference framework accommodates planning-or the selection of the optimal course of action (or policy) in the future. Optimality here is measured in relation to an expected free energy and is distinct from the notion of variational free energy considered above in the context of action and perception. Indeed, while computing variational free energy depends on present and past observations, com- puting expected free energy also requires predicted future observations (hence the term expected). Interestingly, the expected free energy of a policy comprises two parts. The first quantifies the extent to which the policy is expected to resolve uncertainty (exploration) and the second how consis- tent the predicted outcomes are with an agent's goals (exploitation). In contrast with other frameworks, policy selection in Active Inference auto- matically balances exploration and exploitation. The relations between variational and expected free energy will be unpacked in chapter 2.

Under Active Inference, all cognitive operations are conceptualized as inference over generative models-in keeping with the idea that the brain performs probabilistic computations-aka the Bayesian brain hypothesis. Yet, the appeal to a specific approximate form of Bayesian inference- that is, a variational scheme that is motivated by first principles-adds specificity to the process theory. Furthermore, Active Inference extends the inferential approach to domains of cognition that are rarely con- sidered and adds some specificity to the kind of models and inferential processes that may be implemented by biological brains. Under some assumptions, the dynamics that emerge from generative models used in Active Inference closely correspond to widespread models in compu- tational neuroscience, such as predictive coding (Rao and Ballard 1999) and the Helmholtz machine (Dayan et al. 1995). The specifics of the variational scheme will be unpacked in chapter 4.

Under Active Inference, both perception and learning are active processes, for two reasons. First, the brain is essentially a predictive machine, which constantly predicts incoming stimuli rather than passively waiting for them. This is important as perceptual and learning processes are always contextualized by prior predictions (e.g ., expected and unexpected stim- uli affect perception and learning in different ways). Second, creatures engaging in Active Inference actively seek out salient sensory observa- tions that resolve their uncertainty (e.g ., by orienting their sensors or selecting learning episodes that are informative). The active character of perception and learning stands in contrast with most current theo- ries that treat them as largely passive processes; this will be unpacked in chapter 2.

Action is quintessentially goal directed and purposive. It starts from a desired outcome or goal (analogous to the concept of a set-point in cybernetics), which is encoded as a prior prediction. Planning proceeds by inferring an action sequence that fulfills this prediction (or equiva- lently, reduces any prediction error between prior prediction and the current state). The goal-directed character of action in Active Inference is in keeping with early cybernetic formulations but is distinct from most current theories that explain behavior in terms of stimulus-response mappings or state-action policies. Stimulus-response or habitual behav- ior then becomes a special case of a broader family of policies in Active Inference. The goal-directed nature of Active Inference will be unpacked in chapters 2 and 3.

Various constructs of Active Inference have plausible biological analogues in the brain. This implies that-once one has defined a specific genera- tive model for a problem at hand-one can move from Active Inference as a normative theory to Active Inference as a process theory, which makes specific empirical predictions. For example, perceptual inference and learning correspond to changing synaptic activity and changing synaptic efficacy, respectively. Precision of predictions (in predictive cod- ing) corresponds to the synaptic gain of prediction error units. Precision of policies corresponds to dopaminergic activity. Some of the biological consequences of Active Inference will be unpacked in chapter 5.

1.4.2 Part 2: Active Inference in Practice

While the first part of the book provides readers with the conceptual and formal tools to understand Active Inference, the second part focuses on practical issues. Specifically, we hope to provide readers with the tools to understand existing Active Inference models of cognitive functions (and dysfunctions) and to design novel ones. To this aim, we discuss specific examples of models using Active Inference. Importantly, models of Active Inference can vary along different dimensions (e.g ., with discrete or con- tinuous time formulations, flat or hierarchical inference). The second part is structured as follows: In chapter 6, we introduce a recipe to build Active Inference models. The recipe covers the essential steps to design an effective model, which include the identification of the system of interest, the most appropriate form of the generative model (e.g ., to characterize discrete- or continuous- time phenomena), and the specific variables to be included in the model. This chapter therefore offers an introduction to the design principles that underwrite the models discussed in the following chapters.

In chapter 7, we discuss Active Inference models that address problems formulated in discrete time; for example, as hidden Markov models (HMMs) or partially observable Markov decision processes (POMDPs). Our examples include a model of perceptual processing and a model of discrete foraging choices-that is, whether to turn left or right at a decision point to secure a reward. We also introduce topics such as information seeking, learning, and novelty seeking, which can be treated in terms of discrete-time Active Inference.

In chapter 8, we discuss Active Inference models that address problems formulated in continuous time, using stochastic differential equations. These include models of perception (like predictive coding), movement con- trol, and sequential dynamics. Interestingly, it is in the continuous-time formulation that some of the most distinctive predictions of Active Infer- ence appear, such as the idea that movement generation stems from the ful- fillment of predictions and that attentional phenomena can be understood in terms of precision control. We also introduce hybrid models of Active Inference that include both discrete- and continuous-time variables. These permit simultaneous assessment of the choice among discrete options (e.g ., targets for saccades) and the continuous movements resulting from the choice (e.g ., oculomotor movements).

In chapter 9, we illustrate how to use Active Inference models to analyze data from behavioral experiments. We discuss the specific steps that are necessary for model-based data analysis, from the collection of data to the formulation of a model and its inversion to support the analysis of data from single participants or at the group level.

In chapter 10, we discuss the relations between Active Inference and other theories in psychology, neuroscience, Al, and philosophy. We also highlight the most important aspects of Active Inference that distinguish it from the other theories.

In the appendixes, we briefly discuss the mathematical background required to understand the most technical parts of the book, including the notions of Taylor series approximation, variational Laplace, variational calculus, and more. For reference we also present in a concise form the most important equations used in Active Inference.

In sum, the second part of the book illustrates a broad variety of mod- els of biological and cognitive phenomena that can be constructed using Active Inference and a methodology to design novel ones. Apart from the interest of the specific models, we hope that our treatment clarifies the value of using a unified, normative framework to address biological and cognitive phenomena from a coherent perspective. In the end, this is the real appeal of normative frameworks: to provide a unified perspective and a guiding principle to reconcile apparently disconnected phenomena-in this case, phenomena like perception, decision-making, attention, learning, and movement control, each having its separate chapter in any psychology or neuroscience manual.

The models highlighted in the second part have been selected to illus- trate specific points as simply as possible. While we cover several models and domains, from discrete-time decisions to continuous-time perception and movement control, we are clearly disregarding many others that are equally interesting. Many other Active Inference models exist in the liter- ature that cover domains as diverse as biological self-organization and the origins of life (Friston 2013), morphogenesis (Friston, Levin et al. 2015), cog- nitive robotics (Pio-Lopez et al. 2016, Sancaktar et al. 2020), social dynam- ics and niche construction (Bruineberg, Rietveld et al. 2018), the dynamics of synaptic networks (Palacios, Isomura et al. 2019), learning in biological networks (Friston and Herreros 2016), and psychopathological conditions, such as post-traumatic stress disorder (Linson et al. 2020) and panic disor- der (Maisto, Barca et al. 2021). These models vary along many dimensions: some are more directly related to biology whereas others are less so; some are single-agent models whereas others are multi-agent models; some tar- get adaptive inference whereas other target maladaptive inference (e.g ., in patient groups), and so on.

This growing literature exemplifies the increasing popularity of Active Inference and the possibility of using it in a very large variety of domains. The aim of this book is to provide our readers with the ability to under- stand and use Active inference in their own research-possibly, to explore its unforeseen potentialities.

1.5 Summary

This chapter briefly introduces the Active Inference approach to explain biological problems from a normative perspective-and previews some implications of this perspective that will be unpacked in later chapters. Fur- thermore, this chapter highlights the division of the book into two parts, which aim to help readers understand Active Inference and use it in their own research, respectively. Over the next few chapters, we will develop the low road and high road perspectives outlined herein, before delving into the structure of generative models and the resulting message passing. Together these comprise Active Inference in principle and provide the pre- liminaries for Active Inference in practice. We hope that these chapters will persuade readers that Active Inference offers not only a unifying principle under which to understand behavior but also a tractable approach to study- ing action and perception in autonomous systems.

2 The Low Road to Active Inference

My thinking is first and last and always for the sake of my doing.

-William James

2.1 Introduction

This chapter introduces Active Inference by starting from the Helmholtzian- or perhaps Kantian-view of "perception as unconscious inference" (Helm- holtz 1867) and related ideas that have emerged more recently under the Bayesian brain hypothesis. It explains how Active Inference subsumes and extends these ideas by treating not just perception but also action, planning, and learning as problems of (Bayesian) inference and by deriving a principled (variational) approximation to such otherwise intractable problems.

2.2 Perception as Inference

There is a long tradition of seeing the brain as a "predictive machine," or a statistical organ that infers and predicts external states of the world. This idea dates back to the notion of "perception as unconscious infer- ence" (Helmholtz 1866). More recently, this has been reformulated as the "Bayesian brain" hypothesis (Doya 2007). From this perspective, percep- tion is not a purely bottom-up transduction of sensory states (e.g ., from the retina) into internal representations of what is out there (e.g ., as patterns of neuronal activity). Rather, it is an inferential process that combines (top- down) prior information about the most likely causes of sensations with (bottom-up) sensory stimuli. Inferential processes operate on probabilistic representations of states of the world and follow Bayes' rule, which pre- scribes the (optimal) update in the light of sensory evidence. Perception is not a passive outside-in process-in which information is extracted from impressions on our sensory epithelia from "out there." It is a constructive inside-out process-in which sensations are used to confirm or disconfirm hypotheses about how they were generated (Mackay 1956, Gregory 1980, Yuille and Kersten 2006, Neisser 2014, A. Clark 2015).

In turn, performing Bayesian inference requires a generative model- sometimes referred to as a forward model. A generative model is a construct from statistical theory that generates predictions about observations. It may be formulated as the joint probability P(y, x) of observations y and the world's hidden states x that generate these observations. The latter are referred to as hidden or latent states as they cannot be observed directly. This joint probability can be decomposed into two parts. The first is a prior P(x), which denotes the organism's knowledge about the hidden states of the world prior to seeing sensory data. The second is the likelihood P(y|x), which denotes the organism's knowledge of how observations are generated from states. Bayes' rule tells us how to combine these two elements, essentially updating a prior probability P(x) into a posterior probability of hidden states after receiv- ing observations P(x|y). For readers who need a brief refresher on basic probability theory, box 2.1 provides a summary.

Bayesian inference is a broad topic that arises in disciplines like statistics, machine learning, and computational neuroscience. A full treatment of the associated topics is beyond the scope of this book, but there are excellent resources available for those who wish to understand it in depth (Murphy 2012). However, all of this is based on one simple rule. To illustrate this rule, we consider an example of Bayesian perceptual inference (figure 2.1). Imag- ine a person who has a strong belief that she is confronted with an apple. This belief corresponds to a prior probability, or prior for short. This prior comprises the probability attributed to the apple hypothesis and the prob- ability assigned to alternative hypotheses. In this example, our alternative hypothesis is that it is not an apple but a frog. Numerically, the prior prob- ability distribution assigns 0.9 to apple and 0.1 to frog. Note that, as we have assumed that there are only two plausible (mutually exclusive) hypotheses, they must sum to one. The person is also equipped with a likelihood model, which assigns a high probability to the fact that frogs jump, whereas apples do not. This likelihood specifies the (probabilistic) mapping from the two hidden states (frog or apple) to the two observations (jumps or does not jump). Together, the prior and the likelihood form the person’s generative model.

Box 2.1 The sum and product rules of probability

Probabilistic reasoning is underwritten by two key rules: the sum and product rules of probability, which are as follows (respectively):

The sum rule says that the probability of all possible events (x) must sum (or integrate) to one. The product rule says that the joint probability of two random variables (x and y) may be decomposed into the product of the probability of one variable (P(x)) and the conditional probability of the second variable given the first (P(y|x)). A conditional probability is the probability of one variable (here, y) if we know the value that the other variable (here, x) takes.

We can develop two important results from these simple rules. The first is the operation of marginalization. The second is Bayes’ rule. Marginalization allows us to obtain a distribution of just one of the two variables from a joint distribution:

The probability of y is referred to as a marginal probability, and we refer to this operation as marginalizing out x. Bayes’ rule may be obtained directly from the product rule:

This lets us translate between a prior and conditional distribution (likelihood) and the associated marginal and the other conditional distribution (posterior). Put simply, Bayes’ rule just says that the probability of two things is the probability of the first, given the second, times the probability of the second, which is the same as the probability of the second, given the first, times the probability of the first.

Now imagine that the person observes that her apple-frog jumps. Bayes’ rule tells us how to form a posterior belief from the prior, taking into account the likelihood of jumping. This rule is expressed as follows:

Figure 2.1A simple example of Bayesian inference. Upper left: The organism’s prior belief P(x) about the object it will see, before having made any observations, i.e., a categorical distribution over two possibilities, apple (with probability 0.9) and frog (with probability 0.1). Upper right: The organism’s posterior belief P(x|y) after observing that the object jumps. Posterior beliefs can be computed using Bayes’ rule under a likelihood function P(y|x). This is shown below the prior and posterior, and specifies that if the object is an apple, there is a very small probability (0.01) that it will jump, while if it is a frog, the probability that it will jump is much higher (0.81). (The probability bars in this figure are not exactly to scale.) In this specific case, the update from prior to posterior is large.
Figure 2.1A simple example of Bayesian inference. Upper left: The organism’s prior belief P(x) about the object it will see, before having made any observations, i.e., a categorical distribution over two possibilities, apple (with probability 0.9) and frog (with probability 0.1). Upper right: The organism’s posterior belief P(x|y) after observing that the object jumps. Posterior beliefs can be computed using Bayes’ rule under a likelihood function P(y|x). This is shown below the prior and posterior, and specifies that if the object is an apple, there is a very small probability (0.01) that it will jump, while if it is a frog, the probability that it will jump is much higher (0.81). (The probability bars in this figure are not exactly to scale.) In this specific case, the update from prior to posterior is large.

Under the likelihood model in figure 2.1, the posterior probability assigned to the frog is 0.9, and the probability assigned to the apple is 0.1. As highlighted in box 2.1, the denominator of equation 2.1 may be computed by marginalizing the numerator. Using our apple-frog example, we take the opportunity to unpack two different notions of surprise—both of which are important in Active Inference. The first, which we refer to simply as surprise, is the negative log evidence, where evidence is the marginal probability of observations. In our example, this is the negative log probability of observing anything jumping under the generative model. Surprise is a very important quantity from a Bayesian perspective. It is a measure of how poorly a model fits the data it tries to explain. To put this intuitively, we can work out the probability of the observed (jumping) behavior under our model. Remember that this assigns a very high prior probability to apples and a low prior probability to frogs. Thus, our marginal probability of jump-ing is as follows:

This means that, under this model, we would only expect to observe jumping behavior about 9 times out of 100 observations. As such, we should be surprised to observe this if we subscribed to the model in figure 2.1. We can quantify this in terms of surprise (3). This is given by S(y = jumps) = —In P(y = jumps) = -1n(0.09) = 2.4 nats.! The bigger this number, the worse the model as an apt explanation for the observations at hand. This lets us compare models in relation to data. For example, consider an alternative model, where we have a prior belief that frogs are seen 100 percent of the time. Following the same steps as in equation 2.2, we calculate a surprise of about 0.2 nats. This is a better model of these data, as the observation is much less surprising. The procedure of scoring models on the basis of their evidence (or surprise) is often referred to as Bayesian model comparison. For more complicated models, the form of the surprise may not be so simple. Table 2.1 provides the form of the surprise (omitting constants) for a range of probability distributions—in addition to the categorical probability in our example. Crucially, this lets us talk about surprise for probability distributions whose support? differs from the simple example used here. This is important because the way in which sensory data are generated by the world varies with the sort of data. We could be surprised by encountering the face of someone we did not expect to see (categorical distribution), or we could be surprised by it being colder outside than we anticipated (continuous distribution). Table 2.1 may be seen as a portfolio of the probability distributions at our disposal when we come to construct generative models in subsequent chapters. More generally, it makes the point that surprise is a concept that can be evaluated for any given family of probability distributions.

Notes: 1. Special cases include categorical (K >2, N= 1), binomial (K = 2, N>1), and Bernoulli (K =2, N=1) distributions. 2. A special case is the beta distribution (K = 2).
Notes: 1. Special cases include categorical (K >2, N= 1), binomial (K = 2, N>1), and Bernoulli (K =2, N=1) distributions. 2. A special case is the beta distribution (K = 2).

The second notion of surprise is (slightly confusingly) referred to as Bayesian surprise. This is a measure of how much we have to update our beliefs following an observation. In other words, Bayesian surprise quantifies the difference between a prior and a posterior probability. This raises the question of how we quantify the dissimilarity of two probability distributions. One answer, from information theory, is to use a Kullback-Leibler (KL) Divergence. This is defined as the average difference between two log probabilities:

The E symbol here indicates an average (or expectation) as outlined in box 2.2. Using the KL-Divergence, we can quantify the Bayesian surprise of our example:

This scores the amount of belief updating, as opposed to simply how unlikely the observation was. To highlight the distinction between surprise and Bayesian surprise, consider what happens if we commit to a prior belief that we will always see apples. The Bayesian surprise will be zero, as the prior is so confident that we do not update it at all following our observations. However, the surprise is very large (4.6 nats) as it is highly unlikely that an apple will jump.

Box 2.2

Expectations

Note that while we illustrated Bayesian inference on the basis of a very simple generative model, it applies to generative models of any complexity. In chapter 4 we will highlight two forms of generative model that underwrite most applications in Active Inference.

2.3 Biological Inference and Optimality

There are two important points that connect the above inferential scheme to biological and psychological theories of perception. First, the inferential procedure discussed requires the interplay of top-down processes that encode predictions (from the prior) and bottom-up processes that encode sensory observations (as mediated by the likelihood). This interplay of topdown and bottom-up processes distinguishes the inferential view from alternative approaches that only consider bottom-up processes. Furthermore, it is central in modern biological treatments of perception, such as predictive coding (discussed in chapter 4), which is a specific algorithmic (or processlevel) implementation of the more general (Bayesian) inference scheme discussed here.

Second, Bayesian inference is optimal. Optimality is defined in relation to a cost function that is optimized (i.e., minimized), which, for Bayesian inference, is known as variational free energy—closely related to surprise. We return to this in section 2.5. By explicitly considering the full distribution over hidden states, it naturally handles uncertainty, hence avoiding the limitations of alternative approaches that only consider point estimates of hidden states (e.g., the mean value of x). One such alternative would be maximum likelihood estimation, which simply selects the hidden state most likely to have generated the data at hand. The problem with this is that such estimates ignore both the prior plausibility of the hidden state and the uncertainty surrounding the estimation. Bayesian inference does not suffer these limitations. However, despite the use of surprise in objectively assessing whether the model is fit for purpose, it is important to appreciate that inference itself is subjective. The results of inference are not necessarily accurate in any objective sense (i.e., the organism’s belief may not actually correspond to reality) for at least two important reasons. First, biological creatures operate on the basis of limited computational and energetic resources, which render exact Bayesian inference intractable.* This requires approximations that preclude guarantees of exact Bayesian optimality. These approximations include the notion of a variational posterior—based on something called a mean field approximation—which is central to chapter 4.

The second reason optimality may be thought of as subjective is that organisms operate on the basis of a subject’s generative model of how their observations are generated, which may or may not correspond to the real generative process that generates their observations. This is not to say that the generative model should correspond to the generative process. In fact, there may be models that afford better (e.g., simpler) explanations of the data at hand than the processes that actually generated them—as quantified by their relative surprise. A nice example of this is illusions, for which someone finds a simpler explanation for their visual input in relation to how the visual stimuli have been carefully engineered by a mischievous psychophysicist.

The generative model itself may be optimized as new experience is acquired. This may or may not converge to the generative process. Figure 2.2 illustrates

Figure 2.2Generative process and generative model. Both represent ways in which sensory data (vy) could be generated given hidden states (x) and are represented through arrows from x to y to indicate causality. The difference is that the process is the true causal structure by which data are generated, while the model is a construct used to draw inferences about the causes of data (i.e., use observations to derive inferred states). The hidden states of the generative model and the generative process are not the same. The organism’s model includes a range of hypotheses (x) about the hidden state, which do not necessarily include the true value of the hidden state x
Figure 2.2Generative process and generative model. Both represent ways in which sensory data (vy) could be generated given hidden states (x) and are represented through arrows from x to y to indicate causality. The difference is that the process is the true causal structure by which data are generated, while the model is a construct used to draw inferences about the causes of data (i.e., use observations to derive inferred states). The hidden states of the generative model and the generative process are not the same. The organism’s model includes a range of hypotheses (x) about the hidden state, which do not necessarily include the true value of the hidden state x

this point and the difference between true environmental contingencies, or the generative process, which is inaccessible to the organism and the organism’s generative model of the world. In this particular example, the generative process is in a true state x* that is inaccessible to the organism. However, the organism and world are reciprocally coupled, and x* generates an observation y, which the organism senses. The organism can use this observation y and Bayes’ rule to infer the (posterior probability of) some explanatory variable or hidden state in the generative model. In the figure, we refer to both x* and x as hidden states, emphasizing that neither is observable. However, they are subtly different: the former is part of the organism’s generative model, whereas the latter is part of the generative process and inaccessible to the organism. Furthermore, x* and x do not necessarily live in the same space. It might be that the hidden states in the external world take on values that lie outside the space of explanations available to the brain. Conversely, it might be that the brain’s explanations include variables that do not exist in the outside world. For example, the former could be 5-dimensional and the latter 2-dimensional, or one could be continuous and the other categorical.

The distinction between the generative model and process is important to contextualize psychological claims about optimality of inference—to the extent that these claims are valid—which, on a Bayesian view, is always contingent on the organism’s resources. By resources, we mean its specific generative model, and bounded computational and mnemonic resources.

2.4 Action as Inference

The discussion to this point is common to all Bayesian brain theories. However, we now introduce the simple but fundamental advance offered by Active Inference. This starts from the same inferential perspective discussed above but extends it to consider action as inference. This idea stems from the concept that Bayesian inference minimizes surprise (or, equivalently, maximizes Bayesian model evidence). So far, we have considered what happens when we compute surprise by performing inference—and select among models on the basis of their capacity to minimize surprise. However, surprise does not only depend on the model. It also depends on the data. By acting on the world to change the way in which data are generated, we can ensure a model is fit for purpose by choosing those data that are least surprising under our model.

Equipped with a mechanism to produce actions, an organism can engage in reciprocal exchanges with its environment; see figure 2.2. In animals, this mechanism takes the form of a motor reflex loop. Essentially, for each action-perception cycle, the environment sends an observation to the organism. The organism uses (an approximation to) Bayesian inference to infer its most likely hidden states. It then generates an action and sends it to the environment in an attempt to make the environment less surprising. The environment executes the action, generates a new observation, and sends it to the organism. Then, a new cycle starts. The sequential description here is written for didactic purposes; it is important to realize that these are not really discrete steps but are continuous dynamical processes.

Active Inference goes beyond the recognition that perception and action have the same (inferential) nature. It also assumes that both perception and action cooperate to realize a single objective—or optimize just one function— rather than having two distinct objectives, as more commonly assumed. In the Active Inference literature, this common objective has been described in various (informal and formal) ways, including the minimization of surprise, entropy, uncertainty, prediction error, or (variational) free energy. These terms are related to one another but sometimes their relations are not immediately clear, causing some confusion. Furthermore, these terms are used in different contexts; for example, prediction error minimization is used in biological contexts where the objective is explaining brain signals, while variational free energy minimization is used in machine learning.

In the next two sections, we will clarify that the single quantity that Active Inference agents minimize through perception and action is variational free energy. However, under some conditions, one can reduce variational free energy to other notions, such as the discrepancy between the generative model and the world, or the difference between what one expects and what one observes (i.e., a prediction error). We will introduce variational free energy formally in section 2.5. For simplicity section 2.4 focuses on the ways in which perception and action minimize the discrepancy between the generative model and the world.

2.5 Minimizing the Discrepancy between Model and World

Having established perception and action in terms of Bayesian inference, we now turn to the question of what the objective of inference is. In other words, what is being optimized by inference? In cognitive science, it is common to assume that different cognitive functions like perception and action optimize different objectives. For example, we could assume perception maximizes the accuracy of reconstruction while action selection maximizes utility. Instead, a fundamental insight of Active Inference is that both perception and action serve the very same objective. As a first approximation, this common objective of perception and action can be formulated as a minimization of the discrepancy between the model and the world. Sometimes this is operationalized in terms of prediction error.

To understand how perception and action reduce the discrepancy between the model and the world, consider again the example of a person who expects to see an apple (figure 2.3). She generates a top-down visual prediction (e.g., about seeing something red and not jumping). This visual prediction is compared with a sensation (e.g., something jumping)—and this comparison results in a discrepancy.

Figure 2.3 Both perception and action minimize discrepancy between model and world.
Figure 2.3 Both perception and action minimize discrepancy between model and world.

The person can resolve this discrepancy in two ways. First, she could change her mind about what she is seeing (i.e., a frog) to fit the world, hence resolving the discrepancy. This corresponds to perception. Second, she could foveate the nearest apple tree and see something that looks very much like an apple. This also resolves the initial discrepancy, but in a different way. This entails changing the world—including her direction of gaze— and subsequent sensations to fit what is in her mind, not changing her mind to fit the world. This is the other direction of fit. This is action.

While changing the direction of one’s gaze seems less compelling than changing one’s mind in the world of apples and frogs, let us consider another case: a person who expects his body temperature to be in a certain range who senses a high temperature via central thermoreceptors. This is surprising and presents a significant discrepancy to resolve. As in the former example, he has two ways to minimize this discrepancy, corresponding to perception (changing mind) and action (changing the world), respectively. In this case, simply changing one’s mind does not seem very adaptive, but acting to lower the body temperature (e.g., by opening the window) is.

This speaks to the fact that in Active Inference, the notion of marginal probabilities or surprise (e.g., about body temperature) has a meaning that goes beyond standard Bayesian treatments to absorb notions like homeostatic and allostatic set-points. Technically, Active Inference agents come equipped with models that assign high marginal probabilities to the states they prefer to visit or the observations they prefer to obtain. For a fish, this means a high marginal likelihood for being in water. This implies that organisms implicitly expect the observations they sample to be within their comfort zone (e.g., physiological bounds).

In sum, we have discussed how, at any point in time, we can minimize the discrepancy between our model and our world through perception and action. Whether we adjust our beliefs or our data depends on the confidence with which we hold those beliefs. In our example of the apple, the belief is held with sufficient uncertainty that this will be updated as opposed to acted on. In contrast, in the temperature example, we are considerably more confident about our core temperature because it underwrites our existence. This confidence means we update our world to comply with our beliefs. Yet, in Active Inference, perception and action act more cooperatively than suggested by this treatment. To understand why this is the case, the next section moves from the restricted notion of discrepancy (or prediction error) to the more general notion of variational free energy—which is the quantity that Active Inference actually minimizes and which subsumes prediction error as a special case.

2.6 Minimizing Variational Free Energy

So far, we have discussed perception and action within a Bayesian scheme that aims to minimize surprise. Yet, exact Bayesian inference supporting perception and action is computationally intractable in most cases, because two quantities—the model evidence (P(y)) and the posterior probability (P(x |y))— cannot be computed for two possible reasons. The first is that for complex models, there may be many types of hidden states that all need marginalizing out, making the problem computationally intractable. The second is that the marginalization operation might require analytically intractable integrals. Active Inference appeals to a variational approximation of Bayesian inference that is tractable.

The formalism of variational inference will be unpacked in chapter 4. Here, it suffices to say that performing variational Bayesian inference implies substituting the two intractable quantities—posterior probability and (log) model evidence—with two quantities that approximate them but can be computed efficiently—namely, an approximate posterior Q and a variational free energy F, respectively. The approximate posterior is sometimes called a variational or recognition distribution. Negative variational free energy is also known as an evidence lower bound (ELBO), especially in machine learning.

Most importantly, the problem of Bayesian inference now becomes a problem of optimization: the minimization of variational free energy F. Variational free energy is a quantity with roots in statistical physics that plays a fundamental role in Active Inference. In equation 2.5, it is denoted as F[Q,y], as it is a functional (function of a function) of the approximate posterior Q and a function of data y:

Variational free energy may seem prima facie an abstract concept, but its nature and the role it plays in Active Inference become apparent when decomposed into quantities that are more intuitive and familiar in cognitive science. Each of these perspectives on variational free energy offers useful intuitions about what free energy minimization means. We briefly sketch these intuitions here, as they will become important when we discuss examples in the second part of the book.

The first line of equation 2.5 shows that minimizing with respect to Q requires consistency with the generative model (energy) while also maintaining a high posterior entropy.° The latter means that, in the absence of data or precise prior beliefs (which only influence the energy term), we should adopt maximally uncertain beliefs about the hidden states of the world, in accord with Jaynes’s maximum entropy principle (Jaynes 1957). Put simply, we should be uncertain (adopt a high entropy belief) when we have no information. The term energy inherits from statistical physics. Specifically, under a Boltzmann distribution, the average log probability of a system adopting some configuration is inversely proportional to the energy associated with that configuration—that is, the energy required to move the system into this configuration from a baseline configuration.

The second line emphasizes the interpretation of free energy minimization as finding the best explanation for sensory data, which must be the simplest (minimally complex®) explanation that is able to accurately’ account for the data (cf. Occam’s razor). The complexity-accuracy trade-off recurs across several domains, normally in the context of model comparison for data analysis. In statistics, other approximations to model evidence are sometimes used, such as the Bayesian information criterion or Akaike information criterion. The complexity-accuracy trade-off will become important when we describe how to use free energy for model comparison during model-based data analysis—and in the context of structure learning and model reduction. Inferring explanations that have minimal complexity is also important from a cognitive perspective. This is because one can assume that updating what one knows (the prior) to accommodate the data entails a cognitive cost (Ortega and Braun 2013, Zénon et al. 2019); hence, an explanation that diverges minimally from the prior is preferable.

On this view, the complexity cost is just Bayesian surprise. In other words, the degree to which “I change my mind” is quantified by the divergence between the prior and the posterior. This means every accurate explanation for my sensations incurs a complexity cost, and this cost scores the degree of Bayesian belief updating. Variational free energy, then, scores the difference between accuracy and complexity.

The final line expresses the free energy as a bound on negative log evidence (see figure 2.4). As the left part of the figure illustrates, the free energy is an upper bound on negative log evidence, where the bound is the divergence between Q and the posterior probability that would have been obtained were it possible to perform exact (as opposed to variational) inference. The right part of the figure shows that as the divergence decreases, the

Figure 2.4 Variational free energy as an upper bound on negative log evidence.
Figure 2.4 Variational free energy as an upper bound on negative log evidence.

free energy approaches the negative log evidence (surprise)—and becomes equal to surprise, if the approximate posterior Q matches the exact posterior P(x|y). This offers a formal motivation for perceptual inference as one way to lower free energy by optimizing our approximate posterior Q as much as possible.

The final line of equation 2.5 shows that perceptual inference is not the only way to minimize free energy. We could also change the log evidence term through acting to change sensory data. This decomposition is interesting from a cognitive perspective, since minimizing divergence and maximizing evidence map to the two complementary sub-objectives of perception and action, respectively; see figure 2.5. Note that the above expressions all become ways of characterizing the negative log evidence if we replace Q with P(x|y), generalizing to the case of exact inference.

In sum, Active Inference amounts to minimizing variational free energy by perception and action. This minimization permits an organism to fit its generative model to the observations it samples. This fit is a measure of both perceptual adequacy (as expressed by the divergence term) and active control over external states—in the sense that it permits the organism to maintain itself in a suitable set of preferred states, as defined by the generative model. Another way of phrasing this is to appeal to the divergence versus

Figure 2.5 Complementary roles of perception and action in the minimization of variational free energy.
Figure 2.5 Complementary roles of perception and action in the minimization of variational free energy.

evidence decomposition of free energy. Equating the negative log evidence with surprise, and noting that the smallest possible divergence is zero, we see that free energy is an upper bound on surprise. This means it can only be greater than or equal to surprise. When the organism minimizes its divergence (through perception), then free energy becomes an approximation to surprise. When an organism additionally changes the observations it gathers (by acting) to render them more similar to prior predictions, it minimizes surprise.

Variational free energy has a retrospective aspect, as it is a function of past and present, but not future, observations. Although it facilitates inferences about the future based on past data, it does not directly facilitate prospective forms of inference based on anticipated future data. This is important in planning and decision-making. Here, we infer the best actions or action sequences (policies) on the basis of the future observations they are expected to bring about. Doing this requires that we supplement our generative models with the notion of expected free energy.

2.7 Expected Free Energy and Planning as Inference

Expected free energy extends Active Inference to include a quintessentially prospective form of cognition: planning. Planning a sequence of actions, such as the series of moves required to escape from a maze, requires considering future observations that one expects to gather. For example, the consequences of possible courses of action include seeing a dead end after turning right or seeing the exit after a sequence of three left turns. Each possible sequence of actions is termed a policy. This highlights an important distinction made in Active Inference between an action and a policy. The former is something that directly influences the outside world, while the latter is a hypothesis about a way of behaving. The implication is that Active Inference treats planning and decision-making as a process of inferring what to do. This brings planning firmly into the realm of Bayesian inference and means we must specify priors and likelihoods as before (section 2.1). However, in place of frogs and apples, the alternatives are behavioral policies (Is it more probable that I look toward the pond or the tree?). In this section, we first briefly deal with the likelihood—that is, the consequences of pursuing a policy—and then turn to the prior. This is where expected free energy comes in.

Policy-dependent outcomes are not immediately available (they are in the future), but they can be predicted by chaining together two components of the generative model. The first is our beliefs about how hidden states change as a function of policies. We will get into the details of this in chapter 4. For now, we use the notation X to denote a sequence or trajectory of hidden states over time, and we condition trajectories on the policies (z) a creature pursues. This means the dynamical part of our model is given by P(X |). Drawing from our earlier frog-apple example, the policy may be the decision to go to a pond or to an orchard, which changes the probability of encountering frogs versus apples.

The second component of the model is the usual likelihood distribution. This describes which observations to expect in every possible state (e.g., jumping or not, conditioned on frog or apple). By combining these two components, an organism can engage its generative model vicariously to run “what if” or counterfactual simulations of the consequences of its possible actions or policies—for example, “What would happen if I go to the pond?” Marginalizing over states, this gives us the marginal likelihood or evidence for a policy (P(V|7)), or a free energy approximation to this quantity. In other words, knowing how policies influence state transitions lets us compute the likelihood of a sequence of observations under that policy. As we saw in equation 2.1, we need to combine this likelihood with a prior probability to calculate the posterior probability of pursuing a policy.

Active Inference decomposes this planning problem into two successive operations. The first is to compute a score for each policy. The second is to form posterior beliefs about which to pursue. The former defines the prior belief about the policies to pursue, where the best policies have high probability and the worst policies have low probability. Under Active Inference, the goodness of a policy is scored by the associated negative expected free energy—just as the goodness of a model fit is scored by the negative free energy of that model. The expected free energy (G) of policy is different from the variational free energy (F), since calculating the former requires consideration of future, policy-dependent observations. In contrast, the latter only considers present and past observations. Calculating expected free energy therefore engages the generative model to predict future observations that would stem from each policy—if it were to be executed—up to some planning horizon. Furthermore, because a policy unfolds over multiple time steps, the final measure of expected free energy for each policy has to integrate over all future time steps of that policy.

The expected free energy of each policy can be converted in a quality score (by taking its negative) and is directly available as a prior by agents engaging in Active Inference. This is because—consistent with the notion of potential energy in physics—expected free energy is expressed in the space of log probabilities. Converting it into a belief (or probability distribution) over policies is then a matter of exponentiating (to undo the log) and normalizing (to ensure consistency with the sum rule in box 2.1). Policies that are associated with a lower expected free energy are assigned higher probability and become the policies that the organism expects to pursue.

Ultimately, inferring that we are pursuing a particular policy has consequences for the sensory data we predict. For example, a policy that includes flexing my elbow entails predictions about the proprioceptive input from the biceps and triceps muscles. This provides the link between planning and action, as the predictions associated with a plan translate into action that resolves discrepancies with measured proprioceptive data (see section 2.3).

2.8 What Is Expected Free Energy?

So far, we have assumed that during planning, the organism scores its policies according to their expected free energy. However, we have sidestepped what expected free energy actually is. Like variational free energy, the expected free energy can be decomposed in several, mathematically equivalent ways. Each of these provides an alternative perspective on this quantity.

The first of these is perhaps the most useful, intuitively, as it expresses the value of seeking new information (i.e., exploration) in exactly the same units (nats) as the value of seeking preferred observations (i.e., exploitation), dissolving the classic exploit-explore dilemma in behavioral psychology. By minimizing expected free energy, the relative balance between these terms determines whether behavior is predominantly explorative or exploitative. Note that pragmatic value emerges as a prior belief about observations, where the C-parameter includes preferences. The (potentially unintuitive) link between prior beliefs and preferences is unpacked in chapter 7; for now, we note that this term can be treated as an expected utility or value, under the assumption that valuable outcomes are the kinds of outcomes that characterize each agent (e.g., a body temperature of 37°C).

The information gain term inherits from the divergence we considered in section 2.5, which ensures that free energy is an upper bound on surprise. However, there is a twist: instead of minimizing the divergence, we want to select policies that maximize the expected divergence—hence, information gain. This switch is due to the fact that we are now taking an average of the log probabilities over outcomes that have yet to be observed. This is a subtle point that can be understood in terms of outcomes switching their roles. When evaluating the free energy of outcomes, the outcomes are the consequences. However, when evaluating the expected free energy, the outcomes play the role of causes in the sense they are variables that are hidden in the future but explain decisions in the present.

The ensuing information gain penalizes observations for which there is a many-to-one mapping from observations to states—in the sense that one can obtain the same observations in different states—as this precludes precise belief updating. In artificial intelligence and robotics, states that bring the same observation (e.g., two T-junctions of a maze that look identical) are sometimes called aliased and are generally hard to deal with using simple methods (i.e., stimulus-response, with no inference or memory). The problem is that we cannot know which state we occupy from current observations alone. Active Inference avoids getting into such situations in the first place, given their low potential for information gain.

A simple example may help unpack the distinction between information gain (or epistemic value) and pragmatic value and highlight why, in most realistic situations, pragmatic and epistemic values need to be pursued in tandem. Imagine a person who wants an espresso and knows that there are two good cafes in the town: one that opens only from Monday to Friday and another that opens only during the weekend. If he does not know what day of the week it is, he has to first select an action that has epistemic value and resolves his uncertainty (i.e., an epistemic action to look at the calendar)— and only after that select an action that carries pragmatic value and brings the reward (i.e., a pragmatic action to go to the correct cafe). This scenario illustrates the fact that in most uncertain situations, one must first perform epistemic actions to resolve uncertainty before confidently selecting a pragmatic action. Policy selection methods that fail to consider the epistemic affordance of choices can only select policies by using random number generators—and will often miss out on their espresso. Therefore, schemes that consider only pragmatic value are generally restricted to situations with no epistemic uncertainty, such as in the case of a person who already knows the day of the week and hence can head directly to the correct cafe.

The second decomposition in equation 2.6 is in terms of risk and expected ambiguity. These terms are the analogues of complexity and inaccuracy: risk is the expected complexity, and ambiguity is the expected inaccuracy. Risk, a common notion in economics, corresponds to the fact that there can be a one-to-many mapping between policies and their consequences—in the sense that one can obtain several different outcomes (by chance) under the same policy. One example is a gambling scenario with stochastic rewards (e.g., a one-armed bandit, aka a slot machine), wherein one could know the reward distribution—say, that one will obtain reward 10 percent of the time. This is called a risky situation in economics because, after the same move (pulling a lever), one could obtain two different observations (reward or no reward). This means one has to choose policies or plans that accommodate uncertainty. In risk-sensitive schemes—like active inference—the game is to choose policies whose probabilistic outcomes match, in the sense of a KL-Divergence, one’s prior preferences. In short, minimizing complexity cost becomes minimizing risk when both are measures of departure from prior beliefs.

Similarly, ambiguity corresponds to the expected inaccuracy due to an ambiguous mapping between states and outcomes. A mapping is ambiguous if the distribution of outcomes anticipated is highly dispersed (or entropic) even if we know the states generating them with complete confidence. For instance, the probability of heads or tails in a coin flip, conditioned on whether it is sunny or raining, will be maximally ambiguous as there is no relationship between the weather and the 50-50 chance of heads or tails. As such, it would not be possible to gain information about the weather by observing tails. Note that most situations are endowed with both risk and ambiguity—which implies a many-to-one mapping between states and outcomes and between policies and outcomes. Recall that outcomes (observations) are the only sort of variable that can be observed. Active Inference deals automatically with these situations, because expected free energy comprises both risk and ambiguity terms.

The third line of equation 2.6 highlights an alternative formulation of the expected free energy by reexpressing risk as a divergence between beliefs about states and preferences defined in terms of states. An appealing feature of this form is that it may be rearranged into an expected energy and entropy in analogy with variational free energy (equation 2.5). While this relationship is attractive, a downside of this formulation is that it assumes the state-space is known a priori such that prior preferences may be associated with states. In most settings, this is not a problem, and the choice between defining preferences in terms of states or outcomes has little practical relevance. However, common practice is to specify preferences in terms of outcomes—allowing the state-space itself to be learned while preserving extrinsic motivation.

In summary, expected free energy can be decomposed in terms of risk and ambiguity and in terms of pragmatic and epistemic values. These decompositions are interesting as they permit a formal understanding of the wide variety of situations that Active Inference deals with. Furthermore, they facilitate an appreciation of how Active Inference subsumes several decision schemes—which may be obtained by ignoring one or more components of expected free energy (figure 2.6). If one removes prior preferences, the pragmatic value becomes irrelevant, and all action is motivated by epistemic affordances—hence such schemes can only handle the resolution of uncertainty. Once prior preferences are removed, the (negative) expected free energy is variously known as expected Bayesian surprise (in the context of attentional exploration) or intrinsic motivation (in the context of autonomous learning). If one removes ambiguity, the resulting scheme corresponds to risk-sensitive or KL control in control theory. Finally, if one removes both ambiguity and prior preferences, the only remaining imperative is to maximize the entropy of observations (or states, if using the formulation in the third line of equation 2.6). This may be interpreted as uncertainty sampling (or keeping one’s options open). Active Inference evinces the formal relations between these schemes and the (limited) situations in which they apply.

Figure 2.6Various schemes that can be derived by removing terms from the free energy equation. The upper panel shows the terms contributing to the expected free energy. The lower panels show the schemes that result from removing prior preferences (1), ambiguity (2), or everything except for the prior preferences. Each of these quantities appears in several different fields under a variety of names, but all can be seen as components of the same expected free energy.
Figure 2.6Various schemes that can be derived by removing terms from the free energy equation. The upper panel shows the terms contributing to the expected free energy. The lower panels show the schemes that result from removing prior preferences (1), ambiguity (2), or everything except for the prior preferences. Each of these quantities appears in several different fields under a variety of names, but all can be seen as components of the same expected free energy.

Although we have carefully decomposed expected free energy in a way that different people might read this functional, there is no right or wrong way of carving it up. We will see in the second half of this book why autonomous systems of a certain kind must, in virtue of existing, choose actions that look as if they are minimizing expected free energy. This perspective means there is no privileged role for epistemic (explorative) versus pragmatic (exploitative) imperatives—or for risk versus ambiguity. These (possibly false) dichotomies are just two sides of the same existential coin.

2.9 At the End of the Low Road

Having introduced the two distinct notions of variational free energy and expected free energy, we are now in a position to consider what they achieve together. This represents an endpoint to the low road into Active Inference, starting from the notion of unconscious inference, via the Bayesian brain, the duality of perception and action, and finally planning as inference.

Variational free energy is at the core of Active Inference. It measures the fit between the internal generative model and (current and past) observations. By minimizing variational free energy, creatures maximize their model evidence. This ensures that the generative model becomes a good model of the environment and that the environment complies with the model.

Expected free energy is a way to score alternative policies for planning. This is fundamentally prospective—it considers possible future observations— and counterfactual—the possible future observations are conditioned on the policies one could pursue. Expected free energy measures the plausibility of action policies relative to preferred (future) states and observations. By scoring policies in terms of their negative expected free energy, creatures engaging in Active Inference effectively believe that they pursue the course of action for which this quantity is lowest. In psychological terms, this implies that a creature’s belief about policies directly corresponds to its intention— which it fulfills by acting.

From a conceptual perspective, we can associate minimization of variational free energy and expected free energy with two inferential loops, one nested within the other. Variational free energy minimization is the key (outside) loop of Active Inference, which is sufficient to optimize perception and beliefs about policies. An Active Inference agent can also be endowed with a generative model of the consequences of its action that entails an evaluation of expected free energy (the inside loop). This ability to plan into the future supports prospective forms of action selection by furnishing probability values for policies (Friston, Samothrakis, and Montague 2012; Pezzulo 2012).

2.10 Summary

Active Inference is a theory of how living artifacts underwrite their existence by minimizing surprise—or a tractable proxy to surprise, variational free energy—via perception and action. In this chapter, we have sought to motivate this idea starting from a Bayesian treatment of perception as inference and extending this to the domain of action. Bayesian inference rests on a generative model of how sensory observations are generated, which encodes (probabilistically) the organism’s implicit knowledge of the world—formalized as prior beliefs and the expected outcomes under alternative states and policies.

The specific take of Active Inference forces us to revisit the usual semantics of a prior in Bayesian inference. Expected states are preferred and include the organism’s conditions for survival (e.g., niche-specific goal states), whereas their opposite—surprising states—are dis-preferred. In this way, by fulfilling their expectations, Active Inference agents ensure their own survival. Given the important links between the notion of priors and the conditions that undergird an organism’s existence, we can also say that in Active Inference, the identity of an agent is isomorphic with its priors. This terminology will become more familiar later in the book.

Note that in this view, surprise (or sometimes surprisal) is a formal construct of information theory and not necessarily equivalent to a (folk) psychological construct. Roughly, the more the organism’s state differs from the prior (which encodes the preferred states), the more it is surprising— hence Active Inference amounts to the idea that an organism (or its brain) has to actively minimize its surprise to stay alive. Under certain conditions, surprise minimization can be construed as the reduction of the discrepancy between the model and the world. More generally, the quantity that is actually minimized in Active Inference is variational free energy. Variational free energy is an (upper-bound) approximation to surprise and can be minimized efficiently using chemical or neuronal message passing and information that is available to the organism’s generative model.

Importantly, both perception and action minimize variational free energy in complementary ways: by refining their (posterior belief) estimate and by performing actions that selectively sample what is expected. Furthermore, Active Inference also minimizes expected free energy by following policies associated with minimal ambiguity and risk. Expected free energy then extends Active Inference to prospective and counterfactual forms of inference. This completes our journey along the low road to Active Inference. In chapter 3, we will travel the high road, which reaches the same conclusion on the basis of first principles and self-organization.

3 The High Road to Active Inference

Survival machines that can simulate the future are one jump ahead of survival machines who can only learn on the basis of overt trial and error. The trouble with overt trial is that it takes time and energy. The trouble with overt error is that it is often fatal. Simulation is both safer and faster.

—Richard Dawkins

3.1. Introduction

In chapter 2, we motivated the introduction of free energy as a means of performing approximate Bayesian inference (i.e., the low road to Active Inference). Here, we introduce free energy from another perspective, that of the high road, which inverts that reasoning: it starts from first principles in statistical physics and the central imperative that organisms must maintain their existence—that is, avoid surprising states—and then introduces the minimization of free energy as a computationally tractable solution to this problem. The chapter discloses the formal equivalence between the minimization of variational free energy and the maximization of model evidence (or self-evidencing) in approximate Bayesian inference, revealing a connection between free energy and Bayesian perspectives on adaptive systems. Finally, it discusses how Active Inference provides a novel first principle perspective to understand (optimal) behavior.

Active Inference is a theory of how living organisms maintain their existence by minimizing surprise—or a tractable proxy to surprise, variational free energy—via perception and action. By starting from first principles, it advances a novel belief-based scheme to understand behavior and cognition, which has numerous empirical implications.

The high road to Active Inference starts from the premise that, to survive, any living organism has to maintain itself in a suitable set of preferred states, while avoiding other, dis-preferred states of the environment. These preferred states are first and foremost defined by niche-specific evolutionary adaptations. However, as we will see later, in advanced organisms these can also extend to learned cognitive goals. For example, to survive, a fish has to stay in a comfort zone that corresponds to a small subset of all the possible states of the universe: it has to stay in water. Similarly, a human has to ensure that their internal states (e.g., physiological variables like body temperature and heart rate) always remain within acceptable ranges— otherwise they will die (or more precisely will become something else, such as a corpse). This acceptable range or comfort zone stipulatively defines the characteristic states something has to be in to be that thing.

Living organisms resolve this fundamental biological problem by exerting active control over their states (e.g., of body temperature) at many levels, which range from automatic regulatory mechanisms such as sweating (physiology) to cognitive mechanisms such as buying and consuming a drink (psychology) to cultural practices such as distributing air conditioning systems (social sciences).

From a more formal perspective, Active Inference casts the biological problem of—or explanation for—survival as surprise minimization. This formulation rests on a technical definition of surprising states from information theory—essentially, surprising states index those outside the comfort zone of living organisms. It then proposes free energy minimization as a practical and biologically grounded way for organisms or adaptive systems to minimize the surprise of sensory encounters.

3.2 Markov Blankets

An important precondition for any adaptive system is that it must enjoy some separation and autonomy from the environment—without which it would simply dissipate, dissolve, and thereby succumb to environmental dynamics. In the absence of this separation, there would be no surprise to minimize; there must be something to be surprised and something to be surprised about. In other words, there are at least two things—system and environment—and these can be disambiguated from one another. A formal way to express a separation between a system and the rest of the environment is the statistical construct of a Markov blanket (Pearl 1988); see box 3.1.

Box 3.1 Markov blankets

A Markov blanket is an important recurring concept in this book (Friston 2019a, Kirchhoff et al. 2018, Palacios et al. 2020). Technically, a blanket (b) is defined as follows:

This says (in two different but equivalent ways) that a variable y is conditionally independent of a variable x if b is known. In other words, if we know b, knowing x would give us no additional information about u. A common example of this isa Markov chain, where the past causes the present causes the future. In this scenario, the past may only influence the future via the present. This means no additional information about the future is gained by finding out about the past (assuming we know the present).

To identify a Markov blanket in a system wherein we know the conditional dependencies, we can follow a simple rule. The blanket for a given variable comprises its parents (the variables it depends on), its children (the variables that depend on it) and, in some settings, the other parents of its children.

In brief, a Markov blanket is the set of variables that mediate all (statistical) interactions between a system and its environment. Figure 3.1 illustrates an interpretation of a Markov blanket in a dynamic setting. Here the conditional independences have been supplemented with dynamical constraints, so that the flows do not depend upon states on the opposite side of the blanket.

The Markov blanket in figure 3.1 distinguishes states internal to the adaptive system (i.e., brain activity) from external states of the environment. Furthermore, it identifies two additional states, labeled sensory states and active states, which form the blanket that (statistically) separates internal and external states. Statistical separation means that if we knew about the active and sensory states, the external states would offer no additional information about internal states (and vice versa). In a dynamical setting, this is often interpreted as saying internal states cannot directly change external states but can do so vicariously by changing active states. Similarly, external states cannot directly change internal states but can do so indirectly by changing sensory States.

This is a restatement of the classical action-perception cycle, wherein an adaptive system and its environment can interact (only) through actions and observations, respectively. This reformulation has two main benefits.

Figure 3.1A dynamic Markov blanket, which separates an adaptive system (here, the brain) from the environment. The dynamics of each set of states are determined by a deterministic flow specified as a function (f) giving the average rate of change and additional stochastic (random) fluctuations (@). The arrows indicate the direction of influence of each variable over the rates of change of other variables (technically, the nonzero elements of the associated Jacobians). This is just one example; one can use a Markov blanket to separate an entire organism from the environment or nest multiple Markov blankets within one another. For example, brains, organisms, dyads, and communities can be conceived in terms of different Markov blankets that are nested within one another (see Friston 2019a; Parr, Da Costa, and Friston 2020 for a formal treatment). Confusingly, different fields use different notations for the variables; sometimes, sensory states are denoted s, external states 7, and active states a. Here we have chosen variables for consistency with the other chapters in this book.
Figure 3.1A dynamic Markov blanket, which separates an adaptive system (here, the brain) from the environment. The dynamics of each set of states are determined by a deterministic flow specified as a function (f) giving the average rate of change and additional stochastic (random) fluctuations (@). The arrows indicate the direction of influence of each variable over the rates of change of other variables (technically, the nonzero elements of the associated Jacobians). This is just one example; one can use a Markov blanket to separate an entire organism from the environment or nest multiple Markov blankets within one another. For example, brains, organisms, dyads, and communities can be conceived in terms of different Markov blankets that are nested within one another (see Friston 2019a; Parr, Da Costa, and Friston 2020 for a formal treatment). Confusingly, different fields use different notations for the variables; sometimes, sensory states are denoted s, external states 7, and active states a. Here we have chosen variables for consistency with the other chapters in this book.

First, it formalizes the fact that an adaptive system’s internal states are autonomous from environmental dynamics and can therefore resist their influences. Second, it scaffolds the way in which adaptive systems minimize their surprise: it highlights the internal, sensory, and active states they have access to. Specifically, surprise is defined in relation to sensory states, while internal and active state dynamics are the means by which the surprise of sensory states may be minimized.

The key point to notice here is that the internal states of an adaptive system bear a formal relation to external states. This is due to a kind of symmetry across the Markov blanket as both influence and are influenced by blanket states. A consequence of this is that we can construct conditional probability distributions for the internal and external states, given the blanket states. Because these are conditioned on the same blanket states, we can associate pairs of expected internal and external states with one another. In other words, on average, the internal and external states acquire a kind of (generalized) synchrony—just as we might anticipate on attaching a pendulum to each end of a wooden beam. Over time, as they synchronize, each pendulum becomes predictive of the other through the vicarious influence of the beam (Huygens 1673). Figure 3.2 offers a graphical intuition for this relationship. This means that if we can write down independent

Figure 3.2Association between average internal states of a Markov blanket and distributions of external states. Top: Assuming a linear Gaussian form for the conditional probabilities, these plots show samples from the conditional distribution over external and internal states, respectively, given blanket states. The thick black lines indicate the average of these variables given the associated blanket state. Bottom left: The same data are plotted to illustrate the synchronization of internal and external states afforded by sharing a Markov blanket—here, an inverse synchronization. The dashed lines and black cross illustrate that if we knew the average internal state (vertical line), we could identify the average external state (horizontal line) and the spread around this point. Bottom right: We can associate the average internal state with a distribution over the external state.
Figure 3.2Association between average internal states of a Markov blanket and distributions of external states. Top: Assuming a linear Gaussian form for the conditional probabilities, these plots show samples from the conditional distribution over external and internal states, respectively, given blanket states. The thick black lines indicate the average of these variables given the associated blanket state. Bottom left: The same data are plotted to illustrate the synchronization of internal and external states afforded by sharing a Markov blanket—here, an inverse synchronization. The dashed lines and black cross illustrate that if we knew the average internal state (vertical line), we could identify the average external state (horizontal line) and the spread around this point. Bottom right: We can associate the average internal state with a distribution over the external state.

distributions over external and internal states given their Markov blanket, the two states become informative about one another via this blanket.

This synchrony gives internal states the appearance of representing (or modeling) external states—which links back to the idea of surprise minimization introduced in chapter 2. This is because surprise depends on an internal model of how sensory data are generated. To recap, minimizing the surprise (negative log probability) of sensory observations becomes identical to maximizing the evidence (marginal likelihood) for the model, which is just the probability of sensory observations under that model. This notion of surprise minimization can be understood from two equivalent— Bayesian and free energy—perspectives, which we discuss next.

3.3 Surprise Minimization and Self-Evidencing

Under a Bayesian perspective, an agent with a Markov blanket appears to model the external environment in the sense that internal states correspond (on average) to a probabilistic representation—an approximate posterior belief—of external states of the system (figure 3.2). The dynamics of internal states correspond to a form of (approximate) Bayesian inference of external states, as their motion changes the associated probability distribution, which is afforded by an implicit generative model of how sensations (or sensory states in the Markov blanket jargon) are generated. If we reinstate the notion of an agent as constituted by internal and blanket states, we can talk about an agent’s generative model.

Importantly, the agent’s generative model cannot simply mimic external dynamics (otherwise the agent would simply follow external dissipative dynamics). Rather, the model must also specify the preferred conditions for the agent’s existence, or the regions of states that the agent has to visit to maintain its existence, or satisfy the criteria for its existence in terms of occupying characteristic states. These preferred states (or observations) can be specified as the priors of the model—which implies that the model implicitly assumes that its preferred (prior) sensations are more likely to occur (i.e., are less surprising) if it satisfies the criteria for existence. This means it has an implicit optimism bias. This optimism bias is necessary for the agent to go beyond the mere duplication of external dynamics to prescribe active states that underwrite its preferred or characteristic states.

Under this formulation, one can cast optimal behavior (with respect to prior preferences) as the maximization of model evidence by perception and action. Indeed, model evidence summarizes how well the generative model fits or explains sensations. A good fit indicates that the model successfully accounts for its sensations (this is the descriptive side of inference); at the same time, it realizes its preferred sensations, given that they are less surprising (this is the prescriptive side of the inference). Such good fit is a guarantee of surprise minimization, as maximizing model evidence P(y) is mathematically equivalent to minimizing surprise: 3(y) = —InP(y).

A way to reformulate the above arguments more succinctly consists in saying that any adaptive system engages in “self-evidencing” (Hohwy 2016). Selfevidencing here means acting to garner sensory data consistent with (i.e., that affords evidence to) an internal model, hence maximizing model evidence.

3.3.1 Surprise Minimization as a Hamiltonian Principle of Least Action

In the preceding sections, we have asserted that surprise must be minimized but have not detailed why this is. Although the details of the underlying physics of self-evidencing are outside the scope of this book (see Friston 2019b for details), we here provide a brief overview of the principles. These are underwritten by the idea that biological creatures—with Markov blankets—persist over time, resisting the dispersive effects of environmental fluctuations. The persistence of a Markov blanket implies that the distribution of blanket states remains constant over time. Simply put, this means that any deviation of sensory (or active) states from regions that are highly probable under this distribution must be corrected by the average flow of states (which is just the deterministic part of the flow in figure 3.1). Expressing this as a physicist might, stochastic (random) systems at steady state engage in dynamics that (on average) descend an energy function (or Hamiltonian) that is interpretable as a negative log evidence or surprise. This is like a ball rolling down a hill from high gravitational potential energy at the top of the hill to low energy in a basin. See figure 3.3.

For the system shown on the left of figure 3.3, every time a fluctuation causes a move to a less probable state, this is corrected by a move up the probability gradient, such that the system occupies probability-dense regions a greater proportion of the time. The key insight here is that this system maintains sensory states within a narrow range by minimizing surprise (on average)—in contrast to the system on the right, for which surprise grows indefinitely.

Surprise minimization permits living organisms to (temporarily) resist the second law of thermodynamics, which states that entropy—or the

Figure 3.3Left: Path taken by a 2-dimensional random dynamical system with a (nonequilibrium!) steady state. This can be interpreted as minimizing its surprise, which is shown in the contour plot on the right. Right: The center is the least surprising region; the circles moving away from the center represent progressively more surprising regions. Middle: In contrast, this plot shows the trajectory of a system starting in the same place (5,5), with random fluctuations of the same amplitude, whose dynamics bear no relation to surprise. Not only does it enter more surprising regions of space; it also fails to achieve any sort of steady state, dissipating in an unconstrained fashion over time. The scope of Active Inference is restricted to systems like that on the left—which counter random fluctuations with their average flow and thereby retain their form over time.
Figure 3.3Left: Path taken by a 2-dimensional random dynamical system with a (nonequilibrium!) steady state. This can be interpreted as minimizing its surprise, which is shown in the contour plot on the right. Right: The center is the least surprising region; the circles moving away from the center represent progressively more surprising regions. Middle: In contrast, this plot shows the trajectory of a system starting in the same place (5,5), with random fluctuations of the same amplitude, whose dynamics bear no relation to surprise. Not only does it enter more surprising regions of space; it also fails to achieve any sort of steady state, dissipating in an unconstrained fashion over time. The scope of Active Inference is restricted to systems like that on the left—which counter random fluctuations with their average flow and thereby retain their form over time.

dispersion of systemic states—always grows. This is because, on average, entropy is the long-term average of surprise and, on average, the maximization of a log probability of observations is equivalent to minimization of (Shannon) entropy:”

Ensuring that a small proportion of sensory states is occupied with

high probability is equivalent to maintaining a particular entropy. This is a defining characteristic of self-organizing systems, as long recognized by cybernetic theories.

From a physiologist’s perspective, surprise minimization formalizes the idea of homeostasis. As a sensor value leaves its optimal range, negative feedback mechanisms kick in that reverse these deviations. From a control perspective, we can interpret optimal behavior in relation to some desired steady state probability density. In other words, if we define a distribution of preferred outcomes, optimal behavior will involve evolution of the system toward—and maintenance of—that distribution.

As we saw in chapter 2, free energy is an upper bound on surprise, suggesting that optimal behavior can be obtained by minimizing free energy in the face of random fluctuations. Recall that the difference between free energy and surprise is the divergence between an exact posterior probability (i.e., the distribution of external states given blanket states) and an approximate posterior probability (i.e., the distribution over external states given average internal states). As such, the motion of internal states can be thought of as minimizing the divergence, which then enables active states, on average, to minimize the surprise accompanying sensory states. In other words, the optimal behavior resulting from free energy minimization is the one that is least surprising and follows a path of least Action? from the current state to the desired state—that is, the Hamiltonian principle of least Action applied to behavior.

Figure 3.3 shows a very simple example of a system equipped with a random attractor. This is analogous to a thermostat, which (in cybernetic parlance) has a single set-point and cannot learn or plan. Active Inference aims to use the same explanatory apparatus to cover much more complex and adaptive systems. Here, the difference between simplest and more complex systems can be reduced to the different shapes of their attractors—from fixed points to increasingly more complex and itinerant dynamics. From this perspective, one can understand living organisms as constantly seeking a compromise between excessive stability and excessive dispersion—and Active Inference aims to explain how such compromise is achieved.

3.4 Relations between Inference, Cognition, and Stochastic Dynamics

The physicist E. T. Jaynes famously argued that inference, information theory, and statistical physics are different perspectives on the same thing (Jaynes 1957). In the previous sections, we discussed how Bayesian and statistical physics perspectives offer two equivalent ways to understand surprise minimization and optimal behavior—effectively adding a form of cognition to Jaynes’s triad. This equivalence between various schools of thought is appealing but can be confusing to those who are not familiar with the respective formalisms, where many different words are used to refer to the same quantities. To help demystify this, in this section we elaborate on the main equivalences between Bayesian and statistical physics perspectives and their cognitive interpretations; see table 3.1 for a summary and box 3.2.

Table 3.1

Statistical physics, Bayesian inference, and information theory—and their cognitive interpretations

Box 3.2

Free energy in statistical physics and Active Inference

The notion of free energy is widely used in statistical physics to character-ize (for example) thermodynamic systems. Although Active Inference uses exactly the same equations, it applies them to characterize the belief state of an agent (in relation to a generative model). Hence, when we talk of an Active Inference agent minimizing its (variational) free energy, we are referring to processes that change its belief state, not (for example) the particles of its body. To avoid misunderstandings, we use the term variational free energy, hence adopting a terminology that is more common in machine learning. Another more subtle point is that the concept of free energy is often used in the context of equilibrium statistical thermodynamics. Active Inference targets living organisms—or nonequilibrium steady state systems that are open—that feature continuous, reciprocal exchanges with the environment. This is an exciting novel field (Friston 2019a).

3.4.1 Variational Free Energy, Model Evidence, and Surprise

A first important equivalence is between the maximization of model evidence (or marginal likelihood) in Bayesian inference and the minimization of variational free energy—both of which minimize surprise. This equivalence becomes evident when one appeals to a specific approximate solution to intractable problems of inference—variational inference. Variational inference recasts the inference problem as an optimization problem by minimizing free energy. The minimum of the free energy is the point at which the approximation of the exact solution is at its best. Expressing this formally sheds light on the relations between the three quantities:

In equation 3.2, unlike in chapter 2, we have explicitly conditioned all quantities on a model, m, to emphasize that these depend on the model we have (or are) about how y are generated, and the quantities will vary if different models are used. The equivalence of these quantities raises the question as to why it is useful to distinguish between them. The main reason is that, unlike model evidence, variational free energy can be minimized efficiently.

Recall from chapter 2 that the variational free energy is only exactly equivalent to the negative model evidence or surprise when the KL-Divergence term becomes zero. This is not always possible, but this can be made close to zero. Hence, in the process of finding better and better values for Q(x), variational free energy also approximates surprise more closely. We have said this a few times already because it is important to emphasize the central relationship between free energy and surprise that is the foundation of this book. Specifically, free energy is an upper bound on surprise. It can be the same as or greater than surprise—where what is greater than is quantified by the KL-Divergence.

An interesting aspect of this is that any system minimizing its surprise, including the very simple system in figure 3.2, is also minimizing a free energy, where the Q(x) is always set to be equal to the exact posterior probability—that is, setting the KL-Divergence to be zero. One perspective on the difference between cognitive and noncognitive systems is that the latter always have a zero KL-Divergence, while cognitive systems must go through the (perceptual) process of minimizing this term before their actions are guaranteed to minimize surprise. Note that minimizing

the divergence is the only thing that perception can do. This places a great deal of emphasis on the motion of internal states, such that the distribution they parameterize (figure 3.2) is as close to the exact posterior as possible. However, perception cannot minimize the second (evidence) component of variational free energy that corresponds to the actual surprise, because it cannot change the sensations that have been gathered. Only by acting in ways that change sensations can an agent minimize the second (evidence) component of variational free energy and resolve its surprise—or, equivalently, maximize its model evidence. This places emphasis on the motion of active states, given internal states, in self-evidencing.

An example helps in illustrating this point. Imagine that your generative model predicts a distribution of glucose levels in your blood given levels of hunger, with relatively high versus low glucose levels relating to satiation and hunger, respectively. In addition, imagine this model ascribes a higher prior probability to satiation and therefore to relatively high glucose levels— making low glucose levels surprising. Imagine you are initially uncertain about your hunger levels and sense low blood glucose. Perception leads to the inference that you are hungry and the experience of hunger—closing the KL-Divergence. However, perception cannot go further than that to reduce your surprise—and the discrepancy between the high level of glucose that you expect a priori and the low level of glucose that you sense—because it cannot act on your sensations (low glucose) or their causes (physiology). You can only minimize your surprise by acting to change (the hidden source of) the sensations you gather—for example, by eating a dessert.

In sum, perception can minimize variational free energy by reducing the discrepancy between approximate and true posterior but cannot go further in minimizing surprise. The next step of surprise minimization entails changing the sensations one gathers by acting, which is where inference goes beyond perception and becomes active.

3.4.2 Expected Free Energy and Inference of the Most Likely Trajectory

Another important equivalence is between the minimization of expected free energy and inferring the most likely course of action, or policy. This goes beyond specifying the least surprising part of state-space and deals with how surprising alternative routes to that part or location may be. These alternative paths are expressed in terms of policies, which are essentially trajectories across states. Importantly, in Active Inference the log probability of a policy is set proportional to the expected free energy if that policy was pursued. This implies that the most probable or least surprising path is (set to be) the one that minimizes expected free energy. This formulation is equivalent to the way Action is defined in physics, where it scores the probability of a path by an integral (or sum) of an energy. While a physical system may pursue a space of hypothetical trajectories, the path it actually follows is the one for which Action is minimized—that is, Hamilton’s principle of least Action. This analogy between Active Inference and Hamilton’s principle of least Action is unpacked in the next section.

3.5 Active Inference: A Novel Foundation to Understand Behavior and Cognition

In fields like optimal control, reinforcement learning, and economics, the optimization of behavior results from a value function of states, following Bellman’s equation (Sutton and Barto 1998). Essentially, each state (or stateaction pair) is assigned a value, which represents how good a state is for an agent to be in. The value of states (or state-action pairs) is usually learned by trial and error, by counting how many times—and after how much time—one obtains reward by starting from those states. Behavior consists in optimizing reward acquisition by reaching high-valued states, hence capitalizing on learning history.

In contrast, in Active Inference, behavior is the result of inference and its optimization is a function of beliefs. This formulation unites notions of (prior) belief and preference. As discussed above, using the notion of expected free energy amounts to endowing the agent with an implicit prior belief that it will realize its preferences. Hence, the agent’s preference for a course of action becomes simply a belief about what it expects to do, and to encounter, in the future—or a belief about future trajectories of states that it will visit. This replaces the notion of value with the notion of (prior) belief. This is an apparently strange move, if one has a background in reinforcement learning (where value and belief are separated) or Bayesian statistics (where belief does not entail any value). However, it is a powerful move, for at least three reasons.

First, it automatically entails a self-consistent process model of purposive (or teleological) behavior, which is akin to cybernetic formulations. If we endow an Active Inference agent with some prior preference, then it will act to realize such preferences—because this is the only course of action consistent with its prior belief that it will act to fulfill its expectations. Note that the resulting (preferred) course of action, or policy, is directly measurable in experimental settings, whereas a value function or prior belief needs to be inferred and hence is a more indirect, if not tautological, measure.

Second, casting behavior as a functional of beliefs (probability distributions) automatically entails notions such as degree of belief and uncertainty. These notions undergird important facets of adaptive action but are not directly available in the Bellman formulation. By the same token, this formulation gives more flexibility in modeling sequential dynamics and itinerant behaviors, which are harder to model in terms of a value function of states (Friston, Daunizeau, and Kiebel 2009).

Third, in this formulation, optimal behavior comes to follow a Hamiltonian principle of least Action in statistical physics. Indeed, Active Inference goes one step further toward the idea that behavior is a function of beliefs: it also assumes that it becomes an energy function—and the most likely course of action of an Active Inference agent is the one that minimizes free energy. A profound consequence is that living organisms behave according to Hamilton’s principle of least Action: they follow a path of least resistance until they reach a steady state (or a trajectory of states), as exemplified by the behavior of a random dynamical system (shown in figure 3.3). This is a fundamental assumption that distinguishes Active Inference from alternative theories of behavior and cognition based on the Bellman formulation.

It is worth briefly outlining what we mean by drawing analogies between Hamiltonian physics and Active Inference. This is intended on three levels. The first is that the advance offered by Active Inference to the behavioral and life sciences is comparable to the advance Lagrangian* and Hamiltonian formulations offered to Newton’s accounts of mechanics. While Newtonian mechanics were originally formulated in terms of differential equations— including Newton’s famous third law expressing the proportionality between acceleration and force—a complementary perspective on mechanics was offered by considering what is conserved by dynamical systems. Newtonian dynamics can then be derived from these conservation laws. These offer a perspective on which to base further theoretical advances, and they form the basis for parts of stochastic, relativistic, and quantum physics. Analogously, Active Inference reformulates the sorts of neuronal and behavioral dynamics that might previously have been built up from a series of differential equations by specifying the quantity—free energy—from which these dynamics may be derived. Just as different sorts of Hamiltonians lead to different types of physics, free energies based on different generative models lead to different neuronal and behavioral dynamics.

The second point of connection between Hamiltonian physics and Active Inference arises from a more direct association between a Hamiltonian and probability measures. The idea here is to associate the conserved Hamiltonian with the energy of the system. Remember that the quantities we have referred to as energies so far (here and in chapter 2) have all had the form of a negative log probability. This reflects an interpretation of energy as simply a measure of the improbability of any given configuration of a system. On this view, conservation of energy and of probability are equivalent laws. As dissipative systems—coupled to external states via a Markov blanket—move to states of low energy or high probability, we can directly associate the energy or Hamiltonian with surprise. As such, Active Inference is Hamiltonian physics applied to a certain kind of system (systems that feature a Markov blanket).

The third association between these formulations is the variational calculus that underwrites the association between energies and dynamics. This is most apparent when Hamiltonian physics is expressed as a principle of least Action, where Action refers to the integral of a Lagrangian over a path. Crucially, this Action is a functional of a path. Here, a path is a function of time whose output is the position and velocity of a particle on that path at that time. The path followed by a (deterministic) particle minimizes this Action. Similarly, Active Inference is predicated on the idea that beliefs (themselves functions of hidden states) must minimize a free energy functional. The key point of contact here is that in both cases, functions (paths or beliefs) must be optimized in relation to functionals (Action or free energy, respectively). This places both in the context of variational calculus, which is a branch of mathematics dedicated to finding extrema of functionals. In physics, this leads to the Euler-Lagrange equations. In Active Inference, we arrive at variational inference procedures.

3.6 Models, Policies, and Trajectories

In section 3.2, we highlighted that the scope of Active Inference pertains to

those systems that enjoy some separation from their environment and saw that this translates into the presence of a Markov blanket. In section 3.3, we highlighted that the persistence of this blanket requires dynamics that (on average) minimize the surprise of (Sensory) states. As this may be interpreted as self-evidencing, we arrive at the conclusion that behavior is determined by a steady-state distribution that can be interpreted as a generative model of how (sensory) data are generated.

This tells us something very important. Each generative model should be associated with different sorts of behavior. As such, different sorts of behavior may be accounted for by specifying different generative models—and implicitly what that system would find surprising. Furthermore, different kinds of generative model may correspond to adaptive or cognitive creatures having various levels of complexity (Corcoran et al. 2020). Very simple generative models of the sort driving the dynamics in figure 3.3 offer a minimal sort of cognition, as they cannot entertain the possibility of alternative (or counterfactual) trajectories. Further, these models are shallow, in the sense that they afford inference at just one timescale. In contrast, hierarchical generative models afford inference at multiple timescales. In hierarchical or deep models, the dynamics at higher hierarchical levels generally encode things that change more slowly (e.g., the sentence I am reading) and that contextualize things that change faster (e.g., the word I am reading), which are represented at lower hierarchical levels (Kiebel et al. 2008; Friston, Parr, and de Vries 2017).

What do we need to include in a model to derive more complex behaviors of the sort we would associate with agency and sentient systems? One answer to this is the capacity to model alternative futures, or different ways in which events might play out—and to select among them. In turn, considering possible futures requires a generative model that has some temporal depth and explicitly represents the consequences of actions. Working this into the model will ensure behavior that conforms to the most likely of these futures. The (counterfactual) capacity to entertain these alternatives may be what separates the steady state associated with sentient systems from simpler creatures. When alternative futures pertain to things over which we have control, we refer to these as policies or plans. As we saw in chapter 2, one way of disambiguating between these plans is to incorporate a prior belief into a model that says that those policies with the lowest expected free energy are the most plausible. This offers a way of characterizing a certain kind of system with a Markov blanket at steady state—which seems to correspond well to systems like us.

3.7 Reconciliation of Enactive, Cybernetic, and Predictive Theories under Active Inference

By emphasizing free energy minimization, Active Inference unites and extends three apparently disconnected theoretical perspectives.

First, Active Inference is in keeping with enactive theories of life and cognition, which emphasize the self-organization of behavior and autopoietic interactions with the environment, which ensure that living organisms remain within acceptable bounds (Maturana and Varela 1980). Active Inference provides a formal framework explaining how living organisms manage to resist the dispersion of their states by self-organizing a statistical structure— the Markov blanket—that affords reciprocal exchanges between organism and environment while also separating (and in a sense protecting the integrity of) the organisms’ states from external, environmental dynamics.

Second, Active Inference is in keeping with cybernetic theories, which describe behavior as purposive and teleological. Teleology means that behavior is internally regulated by a mechanism that continuously tests whether a goal is achieved and, if not, steers corrective actions (Rosenblueth et al. 1943, Wiener 1948, Ashby 1952, G. Miller et al. 1960, Powers 1973). Similarly, Active Inference agents use both perception and action to minimize the discrepancy between preferred and sensed states. Active Inference provides a normative and viable description of the minimization process by specifying that what is actually minimized is a statistical quantity that the agent can measure—variational free energy—which under certain conditions corresponds to a prediction error, or the difference between what is expected and what is sensed. This implies a formulation of cybernetic control as a prospective process—which leads us to the next point.

Third, Active Inference is in keeping with theories that describe control as a prospective process that rests on a model of the environment—possibly physically implemented in the brain (Craik 1943). Active Inference assumes that agents use a (generative) model to construct predictions that guide perception and action and to evaluate their future (and counterfactual) action possibilities. This assumption is coherent with the good regulator theorem (Conant and Ashby 1970), which says that any controller should have—or be—a good model of the environment. Active Inference reconciles these model-based perspectives on brain and behavior under a rigorous charac-terization in terms of (approximate) Bayesian inference and (variational and expected) free energy minimization. Furthermore, Active Inference is largely coherent with ideomotor theory (Herbart 1825, James 1890, Hoffmann 1993, Hommel et al. 2001), which states that action starts with an imaginative process, and it is a predictive representation (of action consequences) that triggers actions—not a stimulus, like in stimulus-response theory (Skinner 1938). Active Inference casts this idea in an inferential framework, in which an action stems from a belief (about the future); this has a number of implications, such as the fact that in order to trigger action, one has to temporarily attenuate sensory evidence (which would otherwise falsify the belief that triggers action) (H. Brown et al. 2013).

The reconciliation of these frameworks is interesting, as they are often considered at odds. For example, self-organization and teleology are often seen as incompatible in biology. Furthermore, enactive theories tend to de-emphasize representation and control, which is instead a central construct of most theories of model-based inference. Active Inference formalizes autopoietic dynamics of adaptive agents from an unusual angle, which simultaneously considers self-organization and prediction. By connecting different perspectives, Active Inference can potentially help us understand how they illuminate one another.

3.8 Active Inference, from the Emergence of Life to Agency

Active Inference starts from first principles and unfolds them to explain behavior and cognition expressed by the simplest to the most complex forms of adaptive and living systems. In the continuum between simpler and more complex creatures, Active Inference draws a line between those that minimize variational free energy and those that also minimize expected free energy.

Any adaptive system that actively samples sensations to minimize variational free energy is (equivalently) an agent that actively gathers evidence for its generative model, aka a self-evidencing agent (Hohwy 2016). These systems are able to avoid dissipation, self-regulate, and survive by achieving set-points provided by basic homeostatic processes. These systems can generate complex and diverse forms of behavior and can also have very high fitness levels (as is already apparent in the case of viruses). Some may have hierarchical generative models that permit inferring events that change at different timescales, from faster (at lower hierarchical levels) to slower (at higher levels)—and hence can develop sophisticated strategies to deal with what they experience. However, these creatures are also fundamentally limited because their generative models lack temporal depth—or the capacity to plan and consider the future explicitly (although they can do so implicitly, for example, as a result of genetic evolution)—and hence they always live in the present.

A generative model endowed with temporal depth opens the door to the minimization of expected free energy—or in psychological terms, planning. In Active Inference, this entails much more than increased adaptivity: it entails at least a primitive form of agency. For an adaptive system, minimizing expected free energy is equivalent to having the (implicit) prior that one is a free energy minimizing agent—but acts to minimize free energy in the future. When this (prior) belief enters the generative model, the adaptive system becomes able to form beliefs about how it should behave in the future and which trajectories it will pursue. In other words, it becomes able to select among alternative futures as opposed to simply selecting how to deal with the sensed present, as in the simplest agents described above. This temporal depth therefore translates into a psychological depth. To ask about the ways living creatures populate the continuum between the simplest and most complex adaptive systems—and what forms of Active Inference they can express—is an empirical question.

3.9 Summary

The main topics of this chapter can be summarized as follows: Living organisms have to ensure that they only visit their characteristic or preferred states. If one defines these preferred states as expected states, then one can say that living organisms must minimize the surprise of their sensory observations (and maintain an optimal entropy; see box 3.3).

Doing this requires agents to exercise some autonomy from environmental dynamics and to be equipped with a Markov blanket that separates (i.e., expresses a conditional independence between) their internal states and the external states of the environment. Agents within the Markov blanket can engage in reciprocal (action-perception) exchanges with the environment. These exchanges are formally described by the theory of Active Inference, where both perception and action minimize surprise. They can do so by being equipped with a probabilistic generative model of how their sensory observations are generated. This model defines surprise—or better, a tractable proxy, variational free energy, which can be measured and minimized efficiently.

An Active Inference agent appears to perform (approximate) Bayesian inference under a generative model and to maximize evidence for its model— that is, it is a self-evidencing agent. The prospective bit of the inference is realized by selecting courses of actions or policies that are expected to minimize free energy in the future. This formalism leads to a novel view of (optimal) behavior in terms of the Hamiltonian principle of least Action—a (first) principle that connects Active Inference to the domains of statistical physics, thermodynamics, and nonequilibrium steady states.

Box 3.3 Entropy minimization and open-ended behavior

Active Inference is based on the premise that living organisms strive to maintain a relative order (or negative entropy), controllability and predictability, despite being immersed in an environment whose natural forces generate continuous fluctuations—and a never-ending threat of entropic erosion. The most basic manifestation of this active pursuance of order is physiological homeostasis, with critical physiological parameters that need to be kept within viable regions. However, minimizing entropy should not be equated with a rigid repertoire of responses (e.g., autonomic homeostatic responses) but rather the opposite, especially in advanced organisms. We can develop open-ended repertoires of novel behaviors to pursue our original homeostatic imperatives— for example, to produce and buy good wine to satisfy thirst and other needs. This is sometimes referred to as “allostasis” (Sterling 2012).

More broadly, we actively pursue some order and controllability per se, without necessary reference to a specific homeostatic imperative—perhaps because preserving order facilitates many such imperatives. We actively carve our ecological niches to render them more predictable and less surprising. This is evident in the ways we construct our physical spaces (e.g., refuges and cities that give shelter from uncontrolled natural forces) and cultural spaces (e.g., societies with laws and deontic norms that give shelter from anarchic social forces). In all these examples, we usually need to accept some short-term increase of entropy or surprise (e.g., when we build something new or shift social stances) to ensure their long-term decrease. This helps us understand how the basic requirement for surprise minimization is not at odds with but rather promotes the epistemic imperatives and novelty-seeking, curious, and exploratory behavior that we recognize as central to many species.

A first way epistemic imperatives become apparent is during the minimization of variational free energy. One of the ways to decompose free energy is to express it as a Gibbs energy expected under the approximate posterior minus the entropy of the approximate posterior. In other words, the agent is striving to increase entropy. While this seems paradoxical, the paradox disappears if one considers that this is the entropy of the agent’s (approximate posterior) belief. This can be understood as the imperative to explain things as accurately as possible but also “keep options open” and avoid committing to any specific explanation unless this is necessary—that is, the maximum entropy principle (Jaynes 1957).

A second way epistemic dynamics become apparent is during the mini-mization of expected free energy, wherein—interestingly—there are two entropies with opposite signs. These include the posterior predictive entropy (how uncertain I am about what outcomes I would encounter given a choice) that must be maximized—as for beliefs about states in the variational free energy— and the conditional entropy of outcomes given states (the ambiguity entailed by a policy) that must be minimized. While during the minimization of variational free energy the imperative is to maximize entropy of (present) beliefs, during the maximization of expected free energy the imperative is to select actions that minimize the ambiguity of (future) beliefs. This gives rise to epistemic, curious, novelty-seeking, and information-foraging behaviors, which support uncertainty resolution or improvement of the generative model—

which in turn minimizes surprise in the long run (Seth 2013; Friston, Rigoli et al. 2015; Seth and Friston 2016; Schwartenbeck, Passecker et al. 2019).

4 .The Generative Models of Active Inference

Everything should be made as simple as possible, but not simpler. —Albert Einstein

4.1. Introduction

This chapter complements the preceding chapters’ conceptual treatment of Active Inference with a more formal treatment. Specifically, it sets out the relationship between free energy and Bayesian inference, the form of the generative models typically used in Active Inference, and the dynamics obtained from minimizing free energy for these models. A key focus is on how time is represented in a generative model. We will see the distinction between generative models formulated in continuous time and those that treat time as a sequence of events. Finally, we set out the idea of inferential message passing, which underwrites prominent theories in neurobiology— including predictive coding.

4.2 From Bayesian Inference to Free Energy

In the preceding two chapters, we outlined some of the important connections between Active Inference and other established paradigms in the neurosciences. In chapter 2, we focused on the notion of the Bayesian brain (Knill and Pouget 2004, Doya 2007)—one of its closest relatives—which provides a useful way to think about some of the consequences of active inference from a more formal perspective. Specifically, it helps us frame the problems that an agent engaging in Active Inference must solve. Broadly, these are the problem of inferring states of the world (perception) and inferring a course of action (planning). While it is tempting to equate Bayes optimality with exact Bayesian inference, exact inference is generally computationally intractable or even infeasible. In cognitive psychology and artificial intelligence applications, it is common to consider bounded forms of inference and rationality. We highlighted some examples in chapter 3. Under a Bayesian framework, this translates into using approximate inference. These methods comprise sampling methods and variational methods—on which active inference is based. In this section, we recap the basic elements of Bayesian inference and its variational manifestations (Beal 2003, Wainwright and Jordan 2008). In doing so, we hope to provide some intuition for the role of free energy and to emphasize the importance of generative models in drawing inferences about the world.

This chapter is more technical than chapters 1-3, appealing to a little linear algebra, differentiation, and the Taylor series expansion. Those readers interested in the details or in need of a refresher may turn to the appendices for the requisite background. Those who do not want to delve into the theoretical underpinnings may skip this chapter. Throughout, we explain the key implications of each equation—so it should be possible to develop an understanding of the important conceptual points herein even without following the formal argument.

A good place to start is Bayes’ theorem. Recall from chapter 2 that this theorem expresses an equality between the product of a prior and a likelihood and the product of a posterior and a marginal likelihood. This is reproduced in equation 4.1:

The first line of equation 4.1 is Bayes’ theorem. The second line shows that the marginal likelihood (or model evidence), P(y), can be computed directly from the prior and likelihood. This makes the point that the prior and likelihood—which together comprise the generative model—are sufficient for us to compute the model evidence and the posterior probability. Despite this, it is not always easy to do so. The summation (or integration, if dealing with continuous variables) in equation 4.1 can be computationally or analytically intractable. One way to resolve this—the starting point of variational inference—is to convert this potentially difficult integration problem into an optimization problem. To understand how this works, we need to appeal to Jensen’s inequality, which says that “the log^2” of an average

Figure 4.1 Logarithmic function providing intuition for Jensen’s inequality. If we had only two data-points (x, and x,), either we could take their average (E
Figure 4.1 Logarithmic function providing intuition for Jensen’s inequality. If we had only two data-points (x, and x,), either we could take their average (E

is always greater than or equal to the average of a log.” Figure 4.1 provides a graphical intuition for why this is the case.

To take advantage of this property, we can rewrite equation 4.1 by multiplying the term inside the sum on the second line by an arbitrary function (Q) divided by itself (this is equivalent to multiplying by one, so the equality still holds) and taking the log of each side. Mathematically, this changes nothing. However, we can now interpret the expression as an expectation (E)^3 of a ratio between two probabilities and so exploit Jensen’s inequality:

The second line of this equation uses the fact that we have a log expectation and that, by Jensen’s inequality, this must always be greater than or equal to the expectation of the log. This move is sometimes referred to as importance sampling. The right-hand side of this inequality is known as the (negative) variational free energy:* the smaller the free energy, the closer it is to the negative log model evidence. With this in mind, we can rewrite Bayes’ theorem (equation 4.1) in logarithmic form, take its average under the posterior distribution, and disclose the relationship between this and the quantities of equation 4.2:

The second line follows from the fact that the log probability of y is not a function of x, so taking an expectation under the posterior distribution does not change this quantity. Equation 4.3 provides some intuition for the roles of the free energy and the Q distribution—the two quantities that were difficult to compute without the variational approximation. The former plays the role of the negative log model evidence, while the latter acts as if it were the posterior probability. More formally, we can rearrange the free energy as we did in chapter 2 to quantify the relationship between free energy and model evidence:

The first line of equation 4.4 shows the free energy expressed in terms of a KL-Divergence and a negative log evidence. The KL-Divergence is defined in the second line as the expected difference between two log probabilities. This is often used as a measure of how different two probability distributions are from one another.

Sometimes, the use of free energy is motivated directly in terms of this divergence. The argument goes that if our aim is to perform approximate Bayesian inference, we need to find an approximate posterior that best matches the exact posterior. As such, we can select a measure of the divergence between the two—of which the KL-Divergence in equation 4.4 is one example—and minimize this. As we do not know the exact posterior, we cannot use this divergence directly. One solution is to add the log evidence term, which may be combined with the log posterior to form the joint probability (which we do know because this is the generative model). The result is the free energy.

An interesting consequence of this perspective is that there is some ambiguity over which divergence measure to use. If we want to make the approximate and exact posterior as close as possible, we could use the other KL-Divergence, where Q and P are swapped, or choose from a large family of divergences, each of which emphasizes different aspects of the difference between distributions. However, the ideas set out in chapter 3 highlight the importance of self-evidencing for systems engaging in Active Inference. Therefore, we are primarily looking for a tractable evidence maximization scheme and only secondarily looking to minimize the divergence. From this perspective, there is no ambiguity as to which divergence measure to use. This emerges from the use of Jensen’s inequality.

4.3 Generative Models

To calculate the free energy, we need three things: data, a family of variational distributions, and a generative model (comprising a prior and a likelihood). In this section, we outline two very general sorts of generative model used for Active Inference and the form the free energy takes in relation to each. The first deals with inferences about categorical variables (e.g., object identity) and is formulated as a sequence of events. The second deals with inferences about continuous variables (e.g., luminance contrast) and is formulated in continuous time using stochastic differential equations. Before specifying the details of these models, we review a graphical formalism that expresses the dependencies implied by a generative model.

Figure 4.2 shows several examples of generative models expressed as factor graphs, chosen to provide some intuition for the sorts of things that may be articulated in this way. These represent the factors (e.g., prior and likelihood) of a generative model as squares and the variables in that model (hidden states or data) in circles. Arrows indicate the direction of causality between these variables. The upper-left graph shows the simplest form these models can take, with a hidden state (x) causing data (vy). The prior in this model is shown as factor 1, and the likelihood is factor 2. The other graphs extend this idea by introducing additional variables. In the upper right, z plays the role of a second hidden state, so that y depends on the states of both x and z.

As an example, consider a clinical diagnostic test. In this setting, the simple graph in the upper left can be interpreted as the presence or absence of a

Figure 4.2Dependencies between variables in a (graphical) probabilistic model. The circles represent random variables (i.e., the things about which we hold beliefs); the squares represent the probability distributions that describe the relationships between these variables. An arrow from one circle to another via a square indicates that the variable in the second circle depends on that in the first circle and that this dependency is captured in the probability distribution represented by the square.
Figure 4.2Dependencies between variables in a (graphical) probabilistic model. The circles represent random variables (i.e., the things about which we hold beliefs); the squares represent the probability distributions that describe the relationships between these variables. An arrow from one circle to another via a square indicates that the variable in the second circle depends on that in the first circle and that this dependency is captured in the probability distribution represented by the square.

disease (x) and the result of the test (y). The prior is then the prevalence of the disease, while the likelihood specifies the properties of the test. These include its specificity (the probability of a negative result in the absence of the disease) and sensitivity (the probability of a positive result in the presence of the disease). We can then think of the model in terms of the mechanism by which a test result is obtained—going from the top to the bottom of the factor graph. First, we sample a person from a population with known prevalence of a disease. If they have the disease, they will generate a true positive test result with probability given by the test sensitivity, and a false negative otherwise. If they do not have the disease, they will generate a true negative with probability given by the specificity, and a false positive otherwise.

Pursuing the same example, we can interpret the other factor graphs. In the upper-right panel, x and z could be the presence or absence of two different diseases, either of which could give a positive test result. In the lower left, w plays the role of data. Both y and w are generated by x and could represent (for example) two different diagnostic tests that are informative about the same disease process. Finally, the lower-right graph treats both x and v as hidden states but introduces a hierarchical structure in which v causes x causes y. Here we could think of v as providing a context or a predisposing factor (e.g., genetic polymorphism) for the presence or absence of disease x, which may be tested for by measuring y. In principle, we can add an arbitrary number of variables to this hierarchy.

Generative models of this sort are often used for static perceptual tasks, such as object recognition or cue integration. The generative models used for active inference differ in an important way: they evolve over time as new observations are sampled, and the observations that are added depend (via action) on beliefs about variables in the model. This has two key implications. First, the conditional dependencies include the dependencies of hidden variables at a given time on those at previous times. Second, these models sometimes include hypotheses about “how I am acting” as hidden variables.

Figure 4.3 illustrates the two basic forms of dynamic generative model used in active inference (Friston, Parr, and de Vries 2017) in factor graph form (Loeliger 2004, Loeliger et al. 2007). The upper graph shows a Partially Observable Markov Decision Process (POMDP), which expresses a model in which a sequence of states (s) evolves over time. At each time step, the current state is conditionally dependent on the state at the previous time and on the policy (z) currently being pursued. Policies here may be thought of as indexing alternative trajectories, or sequences of actions, that could be followed. Each time-point is associated with an observation (0) that depends only on the state at that time. This sort of model is very useful in dealing with sequential planning tasks—for example, navigating a maze (Kaplan and Friston 2018)—or decision-making processes that involve selecting between alternatives (e.g., categorization of a scene [Mirza et al. 2016]).

The lower graph in figure 4.3 shows a very similar graphical model but expressed in continuous time. In place of representing a trajectory as a series of states, this model represents the current position, velocity, and acceleration (and successive temporal derivatives) of a state (x). These values (referred to as generalized coordinates of motion) can be used to reconstruct a trajectory using a Taylor series expansion (see appendix A for an introduction to Taylor series approximations in this context). The relationship between a state and its temporal derivative here depends on (slowly varying) causes (v) that play a similar role to the policies above. As before,

Figure 4.3Two dynamic generative models (using the same graphical notation as in figure 4.2) that we will appeal to throughout the remainder of this book. Top: Partially Observable Markov Decision Process (POMDP), defined in terms of a sequence of states evolving through time (indexed by the subscript). Bottom: Continuous-time model, of the sort implied by stochastic differential equations (with the prime notation indicating temporal derivatives).
Figure 4.3Two dynamic generative models (using the same graphical notation as in figure 4.2) that we will appeal to throughout the remainder of this book. Top: Partially Observable Markov Decision Process (POMDP), defined in terms of a sequence of states evolving through time (indexed by the subscript). Bottom: Continuous-time model, of the sort implied by stochastic differential equations (with the prime notation indicating temporal derivatives).

states generate observations (y). The difference in notation (s, 7, 0 vs. x, V, y) is used to emphasize the difference between categorical variables that evolve in discrete time and continuous variables that evolve in continuous time. Similarly, from here on, we will use lowercase p and g for probability densities over continuous variables and uppercase P and Q for distributions over categorical variables. Sections 4.4 and 4.5 will unpack these models in more detail and will show how minimization of free energy in each case leads to a set of equations that describes the dynamics of inferential processes.

4.4 Active Inference in Discrete Time

In this section, we focus on the discrete-time model outlined above. This is important for understanding a range of cognitive processes that deal with categorical inferences and selection between alternative hypotheses. This formalism additionally facilitates an examination of the classic exploitationexploration problem and illustrates how active inference resolves this.

4.4.1 Partially Observable Markov Decision Processes

As shown in figure 4.3, a POMDP expresses the evolution over time of a sequence of hidden states that depend on a policy. To specify this process formally, we need to account for the form of each of the square factor nodes in the figure. First, we describe each of these factors. We then combine them to express the joint distribution that constitutes the generative model.

As with the simple example of Bayes’ rule given in chapter 2, we can separate the factors into those representing a likelihood and those combining to make a prior. The likelihood is similar to that used before and expresses the probability of an outcome (observable) given a state (hidden). If both the outcomes and states are categorical variables, the likelihood is a categorical distribution, parameterized by a matrix, A:

The second line here details what is meant by the Cat notation (i.e., specification of a categorical distribution). This accounts for the nodes labeled “2” in figure 4.3. The prior over the sequence (expressed using the ~ symbol) of hidden states depends on two things: the prior over the initial state (specified by a vector, D) and beliefs about how the state at one time transitions to that at the next (specified as a matrix, B):

Together, these account for the “3” nodes in figure 4.3. Note that the transitions are conditionally dependent on the policy chosen. Thus, we can interpret the priors of equation 4.6, combined with the likelihood of equation 4.5, as expressing a model (pi) of a behavioral sequence. To allow us to select between these models (i.e., to form a plan), we need a prior belief about the most probable sequence. For a free energy minimizing creature, a self-consistent prior is that the most probable policies are those that will lead to the lowest expected free energy (G) in the future:

This equation, being of fundamental importance to Active Inference, is worth unpacking in more depth. The first two lines express the prior probability for each policy, as parameterized by mo, as being related to the negative expected free energy associated with that policy. The softmax function (o) enforces normalization (i.e., ensures that the probability over policies sums to one). The final two lines of equation 4.7 express the form of the expected free energy.

Note the similarity between this and the functional form of the free energy (equation 4.4)—with a log probability of outcomes and a KL-Divergence. The key difference here is that the expectation is taken with respect to the posterior predictive density as defined by the final equality. This distribution expresses a joint probability over future states and observations. Crucially, this means we can compute the expected free energy in the future—something we could not do with the variational free energy, which depends on (present and past) observations. In addition, note the distribution over outcomes depends on parameters (C) and the reversal of the sign of the KL-Divergence, which is a consequence of the expectation under the posterior predictive probability. This last point can cause some confusion, so it is worth spelling out explicitly why this is. In the context of the variational free energy, the KL-Divergence was the expected difference between the log probability of the approximate posterior and the log probability of the exact posterior (equation 4.4). The analogous term in the expected free energy is the expected difference between the approximate posterior and the exact posterior we would get on the basis of the entire trajectory of outcomes, using current posterior beliefs as if they were priors. Unpacking this, we get the following:

Here we see that the order in which must take expectations is important. It prompts a reversal in sign relative to the analogous term in the variational free energy. This underwrites an important difference between the two quantities. The expected free energy is minimized by selecting those observations that cause a large change in beliefs, in contrast to the variational free energy that is minimized when observations comply with current beliefs. This is the difference between optimizing beliefs in relation to data that have already been gathered (variational free energy minimization) and selecting those data that will best optimize beliefs (expected free energy minimization).

This reiterates that Active Inference uses two constructs, variational free energy (F) and expected free energy (G), which are mathematically related but play distinct and complementary roles. Variational free energy is the primary quantity that is minimized over time. It is optimized in relation to a generative model, which can include policies (or action sequences). As with all other hidden states, the agent needs to assign a prior probability to policies—because policies are just another random variable in the generative model. Active Inference uses a prior that is (loosely speaking) equivalent to the belief that one will minimize free energy in the future: that is, the expected free energy. In other words, expected free energy furnishes a prior over policies and is therefore a prerequisite in minimizing variational free energy.

In chapter 2 we saw that, as with the variational free energy, the expected free energy can be rearranged in a number of ways to disclose various interpretations. Here, we focus on an interpretation as the difference between the risk and the ambiguity associated with a policy. This is equivalent to the expression in equation 4.7:

Recall from chapter 2 that the first of these expresses the trade-off between seeking new information (i.e., exploration) and seeking preferred observations (i.e., exploitation). By minimizing expected free energy, the relative balance between these terms determines whether behavior is predominantly explorative or exploitative. Note that pragmatic value emerges as a prior belief about observations, where the C-parameters of this distribution may be chosen to reflect the sort of system we are interested in characterizing (in terms of its characteristic or preferred outcome states). Following the second line of equation 4.9, we can rewrite equation 4.7 in linear algebraic form as follows:

The first line of equation 4.10 uses a softmax (normalized exponential) operator to construct a probability distribution (parameterized with sufficient statistics m)) that sums to one from the expected free energy vector. Lines two to four express the components of the expected free energy in linear algebraic notation. The fifth line shows that the prior belief about observations is a categorical distribution (whose sufficient statistics are given in the C vector). The sixth to eighth lines specify the relationship between the linear algebraic quantities and the associated probability distributions. Having completed the specification of the generative model, we can now express the free energy in terms of the variables above:

The decomposition of this into a sum over time is due to the implicit mean-field approximation that assumes we can factorize the approximate posterior into a product of factors:

In logarithmic form, this becomes a sum, just as in equation 4.11. This factorization is one of many possibilities in variational inference—and represents the simplest option. In practice, this is often nuanced slightly, as detailed in appendix B.

4.4.2 Active Inference in a POMDP

Hitherto, we have defined the four key ingredients for a discrete-time generative model. These are the likelihood (A), transition probabilities (B), prior beliefs about observations (C), and prior belief about the initial state (D). Once these probability distributions are specified, a generic message passing scheme can be employed to minimize free energy and solve the POMDP. To make inferences about hidden states under a given policy, we set the rate of change of an auxiliary variable (v), which stands in for the log posterior (s), to be equal to the negative free energy gradient. A softmax (normalized exponential) function is then used to compute s from v.

Equation 4.13 can be regarded as an example of variational message passing (see box 4.1). To update beliefs about policies, we find the posterior that minimizes the free energy:

For the simplest form of POMDP, equations 4.13 and 4.14 can be used to solve an Active Inference problem for any set of probability matrices; these may be thought of as describing perception and planning, respectively. We will unpack this in greater detail in the second part of the book, where we will provide worked examples of Active Inference for perception and planning (and other cognitive functions).

Figure 4.4’s graphical representations of equations 4.10, 4.13, and 4.14 hint at possible neuronal implementations of free energy minimization in the brain—if one interprets nodes as neuronal populations, edges as synapses, and messages as synaptic exchanges. In later chapters we will consider the extension of this to factorized state-spaces, deep temporal models, and the optimization of the parameters of the generative model itself (learning).

4.5 Active Inference in Continuous Time

In the previous section, we dealt with the form Active Inference takes under a particular choice of generative model. These POMDPs are a useful way to

Box 4.1 Message passing and inference

Markov blankets

We encountered the concept of a Markov blanket in chapter 3. However, it is worth briefly reviewing the idea here. It relates to a system of multiple interacting variables. A Markov blanket for a given variable comprises a subset of those that interact with it. If we know everything about this subset, knowledge of anything outside this subset does not increase our knowledge of the variable of interest. The relevance here is that we can draw inferences about a variable in a graphical model based on local information about its Markov blanket. The blanket of a variable x are those variables that cause x (parents, p(x)), the variables that are caused by x (children, «(x)), and the parents of x’s children. Using this notation, two of the most common Bayesian message passing schemes used for approximate inference are defined as follows:

Variational message passing

This involves messages from all constituents of the Markov blanket of x, including the parents (via the conditional probability of x given its parents) and the children. The latter depends on the conditional probability of the children of x given all of their parents—which include x. Note the expectation includes the children and parents of the children. As the parents of the children include x, we divide by Q(x) to ensure the expectation includes the blanket only.

Belief propagation

This has broadly the same structure as variational message passing but uses a recursive definition of messages such that each message (u,(b) being the message to b from a) depends on other messages (the messages to a). There is a directional aspect to this, such that the message from a to b depends on all messages to a, except for that from b (hence the division in the expectations). NB: The slightly nonstandard use of the expectation operator here allows us to (1) cover both discrete and continuous variables and (2) highlight the formal similarity between variational message passing and belief propagation.

Figure 4.4Bayesian message passing. Right: Dependencies between different variables in the belief-updating scheme outlined in the main text. Intuitively, current beliefs about states (under each policy) at each time are compared with those that would be predicted given beliefs about states at other times (1) and current outcomes to calculate prediction errors. These errors then drive updating in these beliefs (2); given beliefs about states under each policy, we can then calculate the gradients of the expected free energy (3). These are combined with the outcomes predicted under each policy (omitted from the figure) to compute beliefs about policies (4). Using a Bayesian model average, we can then compute posterior beliefs about states averaged over policies (5S). This high-level summary of message passing omits some intermediate connections that could be included (e.g., connection (4) could be unpacked to explicitly include computation of the expected free energy). Left: This scheme could be expanded hierarchically (collapsing over time steps and policies for simplicity). The key idea is that a higher-level network might predict the states and policies at the lower level and use these to draw inferences about the context in which these occur. We will unpack this idea further in chapter 7.
Figure 4.4Bayesian message passing. Right: Dependencies between different variables in the belief-updating scheme outlined in the main text. Intuitively, current beliefs about states (under each policy) at each time are compared with those that would be predicted given beliefs about states at other times (1) and current outcomes to calculate prediction errors. These errors then drive updating in these beliefs (2); given beliefs about states under each policy, we can then calculate the gradients of the expected free energy (3). These are combined with the outcomes predicted under each policy (omitted from the figure) to compute beliefs about policies (4). Using a Bayesian model average, we can then compute posterior beliefs about states averaged over policies (5S). This high-level summary of message passing omits some intermediate connections that could be included (e.g., connection (4) could be unpacked to explicitly include computation of the expected free energy). Left: This scheme could be expanded hierarchically (collapsing over time steps and policies for simplicity). The key idea is that a higher-level network might predict the states and policies at the lower level and use these to draw inferences about the context in which these occur. We will unpack this idea further in chapter 7.

articulate a range of inference problems, including those that underwrite planning and decision-making. However, when it comes to interacting with a real environment, models described in discrete time with categorical variables fall short. This is because sensory input and motor outputs are continuously evolving variables. To account for this, we now turn to a different sort of generative model. We apply exactly the same idea, a gradient descent on variational free energy, to these models to find the analogous message passing schemes.

4.5.1 A Generative Model for Predictive Coding

To motivate the form of generative model used for continuous states, we start with the following pair of equations:

The first of these expresses the evolution of a hidden state over time, according to a deterministic function (f(x,v)) and stochastic fluctuations (@). The second equation expresses the way in which data are generated from the hidden state. In each case, the fluctuations are assumed normally distributed, giving the following probability densities for the dynamics and likelihood:

The precision (II) terms are the inverse covariance of the fluctuations.

These two equations form the generative model that underwrite KalmanBucy filters in engineering. However, schemes of this sort are limited by the assumption of uncorrelated fluctuations over time (i.e., Wiener assumptions). This is inappropriate for inference in biological systems, where fluctuations are themselves generated by dynamical systems and have a degree of smoothness. We can account for this by considering not only the rate of change of the hidden state and the current value of the data but also their velocities, accelerations, and subsequent temporal derivatives—that is, generalized coordinates of motion (Friston, Stephan et al. 2010; see box 4.2):

6 . A Recipe for Designing Active Inference Models

Give me six hours to chop down a tree and I will spend the first four sharpening the axe.

—Abraham Lincoln

6.1. Introduction

This chapter provides a four-step recipe to construct an Active Inference model, discussing the most important design choices one has to make to realize a model and providing some guidelines for those choices. It serves as an introduction to the second part of the book, which will illustrate several specific computational models using Active Inference and their applications in a variety of cognitive domains.

As Active Inference is a normative approach, it tries to explain as much as possible about behavior, cognitive, and neural processes from first principles. Consistently, the design philosophy of Active Inference is top-down. Unlike many other approaches to computational neuroscience, the challenge is not to emulate a brain, piece by piece, but to find the generative model that describes the problem the brain is trying to solve. Once the problem is appropriately formalized in terms of a generative model, the solution to the problem emerges under Active Inference—with accompanying predictions about brains and minds. In other words, the generative model provides a complete description of a system of interest. The resulting behavior, inference, and neural dynamics can all be derived from a model by minimizing free energy.

The generative modeling approach is used in several disciplines for the realization of cognitive models, statistical modeling, experimental data analysis, and machine learning (Hinton 2007b; Lee and Wagenmakers 2014; Pezzulo, Rigoli, and Friston 2015; Allen et al. 2019; Foster 2019). Here, we are primarily interested in designing generative models that engender cognitive processes of interest. We have seen this design methodology in previous chapters. For example, using a generative model for predictive coding, perception was cast as an inference about the most likely cause of sensations; using a generative model that evolves in discrete time, planning was cast as an inference about the most likely course of action. Depending on the problem of interest (e.g., planning during spatial navigation or planning saccades during visual search), one can adapt the form of these generative models to equip them with different structures (e.g., shallow or hierarchical) and variables (e.g., beliefs about allocentric or egocentric spatial locations). Importantly, Active Inference may take on many different guises under different assumptions about the form of the generative model being optimized. For example, assumptions about models that evolve in discrete or continuous time influence the form of the message passing (see chapter 4). This implies that the choice of a generative model corresponds to specific predictions about both behavior and neurobiology.

This flexibility is useful as it allows us to use the same language to describe processes in multiple domains. However, it can also be confusing from a practical perspective, as there are a number of choices that must be made to find the appropriate level of description for the system of interest. In the second part of this book, we will try to resolve this confusion through a series of illustrative examples of Active Inference in silico. This chapter introduces a general recipe for the design of Active Inference models, highlighting some of the key design choices, distinctions, and dichotomies that will appear in the numerical analysis of computational models described in subsequent chapters.

6.2 Designing an Active Inference Model: A Recipe in Four Steps

Designing an Active Inference model requires four foundational steps, each resolving a specific design question:

  1. Which system are we modeling? The first choice to make is always the system of interest. This may not be as simple as it seems; it rests on the identification of the boundaries (i.e., Markov blanket) of that system. What counts as an Active Inference agent (generative model), what

    counts as the external environment (generative process), and what is the interface (sensory data and actions) between them?

  2. What is the most appropriate form for the generative model? The first of the next three practical challenges is deciding whether it is appropriate to think of a process more in terms of categorical (discrete) inferences or continuous inferences, motivating the choice between discrete or continuous-time implementations (or a hybrid) of Active Inference. Then we need to select the most appropriate hierarchical depth, motivating the choice between shallow versus deep models. Finally, we need to consider whether it is necessary to endow generative models with temporal depth and the ability to predict action-contingent observations to support planning.

  3. How to set up the generative model? What are the generative model’s most appropriate variables and priors? Which parts are fixed and what must be learned? We emphasize the importance of choosing the right sort of variables and prior beliefs; furthermore, we emphasize a separation in timescales between the (faster) update of state variables that occurs during inference and the (slower) update of model parameters that occurs during learning.

  4. How to set up the generative process? What are the elements of the generative process (and how do they differ from the generative model)?

    These four steps (in most cases) suffice to design an Active Inference model. Once completed, the behavior of the system is determined by the standard schemes of Active Inference: the descent of the active and internal states on the free energy functional associated with the model. From a more practical perspective, once one has specified the generative model and generative process, one can use standard Active Inference software routines to obtain numerical results, as well as to perform data visualization, analysis, and fitting (e.g., model-based data analysis). In what follows, we will review the four design choices in order.

6.3 What System Are We Modeling?

A useful first step in applying the formalism of Active Inference is to identify the boundaries of the system of interest because we are interested in characterizing the interaction between what is internal to a system and the external world via sensory receptors and effectors (e.g., muscles or glands). As discussed in chapter 3, a formal way to characterize the distinction between internal states of a system and external variables (and intermediate variables that mediate their interactions) is in terms of a Markov blanket (Pearl 1988). To reiterate the argument, a Markov blanket may be subdivided into two sorts of variables (Friston 2013): those that mediate the influence of the external world on internal states of the system of interest (i.e., sensory states) and those that mediate the influence of internal states of the system of interest on the external world (i.e., active states). See figure 6.1. Importantly, there are many ways in which a boundary between internal and external may be defined. In most of the simulations we will discuss in the second part of this book, there will be a (Markov blanket) separation between an agent (roughly, a living organism) and its environment. This corresponds to the usual setup of cognitive models, where an agent implements cognitive processes such as perception and action selection on the basis of its internal (e.g., brain) states and is provided with sensors and effectors.

Figure 6.1Action-perception loop between an adaptive system (here, the brain) and the environment, along with the Markov blanket (composed of active states and sensory states) that mediates their interaction. The figure implies that the adaptive system only affects the environment by performing actions (via active states) and that the environment only affects the adaptive system by producing observations (via sensory states). The figure exemplifies the distinction between the adaptive system’s generative model and the (external) generative process that produces its observations.
Figure 6.1Action-perception loop between an adaptive system (here, the brain) and the environment, along with the Markov blanket (composed of active states and sensory states) that mediates their interaction. The figure implies that the adaptive system only affects the environment by performing actions (via active states) and that the environment only affects the adaptive system by producing observations (via sensory states). The figure exemplifies the distinction between the adaptive system’s generative model and the (external) generative process that produces its observations.

However, this is not the only possibility. From the perspective of neurobiology, we could draw a Markov blanket around a single neuron, around the brain, or around the entire body. In the first case, sensory states include postsynaptic receptor occupancies, and active states include the rate at which vesicles containing neurotransmitters fuse with the presynaptic membrane. The internal states of the neuron (e.g., membrane potentials, calcium concentrations) can then be thought of as inferring the causes of its sensory states according to some (implicit) generative model (Palacios, Isomura et al. 2019). This setup treats the external states (that are being modeled) as including the neuronal network in which our neuron participates. This is very different from the inference taking place when we assume our entire network is internal to the Markov blanket. For example, if we take a system whose sensory states are the photoreceptors in the retina and whose active states are the oculomotor muscles, the inferences performed by the internal states are about things outside the brain. This speaks to the importance of scale, as the internal states of this Markov blanket include the internal states from the perspective of a single neuron. The latter internal states appear to make inferences about things within the brain when the Markov blanket is drawn around a single neuron but not when the blanket is drawn around the nervous system.

The above is particularly relevant when dealing with embodied or extended perspectives on cognition (Clark and Chalmers 1998; Barsalou 2008; Pezzulo, Lw et al. 2011). For example, if we draw the blanket around the nervous system, the rest of the body becomes an external state, about which we must make inferences from interoceptive sensory states (Allen et al. 2019). Alternatively, we could draw our blanket around the entire organism. This would make it look as if organs other than the brain were making inferences about their environment. For example, depression of the skin in response to an external pressure could be framed as an inference about the source of the external pressure. The extended cognition perspective takes this further and says that objects external to the body may be incorporated into the Markov blanket (e.g., the use of a calculator to assist in inference implies that the calculator is part of the internal state-space of the inferring system). Finally, we could have multiple Markov blankets, nested within one another (e.g., brains, organisms, communities).

In sum, defining the Markov blanket ensures we know what is being inferred (external states) and what is doing the inferring. Indeed, minimization of free energy with respect to a generative model only involves the internal and active states of a system: these only see the sensory states, so they can only infer the external state of the world vicariously.

6.4 What Is the Most Appropriate Form for the Generative Model?

Once we have decided on the internal states of a system and the states that mediate their interaction with the world outside, we need to specify the generative model that explains how external states influence sensory states. As discussed in previous chapters, Active Inference can operate on different kinds of generative models. Therefore, we need to specify the most appropriate form of the generative model for the problem at hand. This implies making three main design choices. The first is a choice between models that include continuous or discrete variables (or both). The second is a choice between shallow models, in which inference operates on a single timescale (i.e., all variables evolve at the same timescale), and hierarchical or deep models, in which inference operates on multiple timescales (i.e., different variables evolve at different timescales). The third is a choice between models that only consider present observations versus models having some temporal depth, which consider the consequences of actions or plans.

6.4.1 Discrete or Continuous Variables (or Both)?

The first design choice is to consider whether generative models that use discrete or continuous variables are more appropriate. The former include object identities, alternative action plans, and discretized representations of continuous variables. These are modeled through expressing the probability—at each time step—of one variable transitioning into another type. The latter include things like position, velocity, muscle length, and luminance and require a generative model expressed in terms of rates of change.

Computationally, the distinction between the two may not be clear-cut because a continuous variable may be discretized, and a discrete variable may be expressed through continuous variables. However, this distinction is important conceptually, as it underlies specific hypotheses about the time course (discrete or continuous) of the cognitive processes of interest.' In most current implementations of Active Inference, high-level decision processes, such as the choice between alternative courses of actions, are modeled using discrete variables, whereas more fine-grained perception and action dynamics are implemented using continuous variables; we will provide examples of both in chapters 7 and 8, respectively.

Furthermore, the choice between discrete and continuous variables is relevant for neurobiology. While each style of modeling appeals to free energy minimization, the message passing these imply take different forms. To the extent that one considers message passing relevant for a process theory (see chapter 5), this implies that the neural dynamics that realize this minimization are different under each sort of model. Continuous schemes underwrite predictive coding—a theory of neural processing that relies on top-down predictions corrected by bottom-up prediction errors. However, the analogous process theories for discrete inferences involve messages of a different form. Finally, the two types of model may be combined such that discrete states are associated with continuous variables. This means we can specify a generative model wherein a discrete state (e.g., object identity) generates some pattern of continuous variables (e.g., luminance). We will discuss an example of a hybrid or mixed generative model that includes both discrete and continuous variables in chapter 8.

6.4.2 Timescales of Inference: Shallow versus Hierarchical Models

The second design choice concerns the timescales of Active Inference. One can select either (shallow) generative models, in which all the variables evolve at the same timescale, or (hierarchical or deep) models, which include variables that evolve at different timescales: slower for higher levels and faster for lower levels.

While many simple cognitive models only require shallow models, these are not sufficient when there is a clear separation of timescales between different aspects of a cognitive process of interest. One example of this is in language processing, in which short sequences of phonemes are contextualized by the word that is spoken and short sequences of words are contextualized by the current sentence. Crucially, the duration of the word transcends that of any one phoneme in the sequence and the duration of the sentence transcends that of any one word in the sequence. Hence, to model language processing, one can consider a hierarchical model in which sentences, words, and phonemes appear at different (higher to lower) hierarchical levels and evolve over (slower to faster) timescales that are approximately independent of one another. This is only an approximate separation, as levels must influence each other (e.g., the sentence influences the next words in the sequence; the word influences the next phonemes in the sequence).

However, this does not mean we need to attempt to model the entire brain to develop meaningful simulations of a single level. For example, if we wanted to focus on word processing, we could address some aspects without having to deal with phoneme processing. This means we can treat input from parts of the brain drawing inferences about phonemes as providing observations from the perspective of word-processing areas. Phrasing this in terms of a Markov blanket, this typically means we treat the inferences performed by lower levels of a model as part of the sensory states of the blanket. This means we can summarize the inferences performed at the timescale of interest without having to specify the details of lower-level (faster) inferential processes—and this hierarchical factorization entails great computational benefits.

Another example is in the domain of intentional action selection, where the same goal (enter your apartment) can be active for an extended period of time and contextualizes a series of subgoals and actions (find keys, open door, enter) that are resolved at a much faster timescale. This separation of timescales, whether in the continuous or discrete domain, demands a hierarchical (deep) generative model. In neuroscience, one can assume that cortical hierarchies embed this sort of temporal separation of timescales, with slowly evolving states at higher levels and rapidly evolving states at lower levels, and that this recapitulates environmental dynamics, which also evolve at multiple timescales (e.g., during perceptual tasks like speech recognition or reading). In psychology, this sort of model is useful in reproducing hierarchical goal processing (Pezzulo, Rigoli, and Friston 2018) and working memory tasks (Parr and Friston 2017c) of the sort that rely on delay-period activity (Funahashi et al. 1989).

6.4.3 Temporal Depth of Inference and Planning

The third design choice concerns the temporal depth of inference. It is important to draw a distinction between two kinds of generative model: the first have temporal depth and represent explicitly the consequences of actions or action sequences (policies or plans), whereas the second lack temporal depth and only consider present but not future observations. These two kinds of model are exemplified in figure 4.3: the dynamic POMDP at the top and the continuous-time model at the bottom.” The key difference between these two models is not that they use discrete or continuous variables, respectively, but that only the former (temporally deep) model endows creatures with the ability to plan ahead and select among possible futures.

Imagine a rodent who plans a route to a known food location in a maze. Doing this benefits from a temporally deep model, loosely equivalent to a spatial or cognitive map (Tolman 1948), which encodes contingencies between present and future locations conditioned on actions (e.g., the future location after turning right or left). The animal can use the temporally deep model to counterfactually consider multiple courses of action (e.g., series of right and left turns) and select the one expected to reach the food location.

Why is a temporally deep model required for planning? In Active Inference, planning is realized by calculating the expected free energy associated with different actions or policies and then selecting the policy that is associated with the lowest expected free energy. Expected free energy is not just a function of present observations (like variational free energy) but also a functional of future observations. The latter cannot be observed (by definition) but only predicted using a temporally deep model, which describes the ways in which actions produce future observations.

When designing an Active Inference agent it is useful to consider whether it should have planning and future-oriented capacities—and, in this case, to select a temporally deep model. Furthermore, it is useful to consider planning depth—that is, how far in the future the planning process can look. Finally, one can design generative models that are both hierarchical and temporally deep, wherein planning proceeds at multiple timescales—faster at lower levels, and slower at higher levels.* The decision whether to model alternative futures, contingent on policy selection, is largely tied up with the choice between discrete and continuous models because the idea of selecting between alternative futures, defined by sequences of actions, is more simply articulated using discrete-time models.

6.5 How to Set Up the Generative Model?

When we have specified our system of interest and identified the relevant forms of the generative model (e.g., continuous or discrete representation, shallow versus hierarchical structure), our next challenges are to specify the specific variables to include in the generative model and decide which of these variables remain fixed or change as an effect of learning.

6.5.1 Setting Up the Variables of the Generative Model

The variables of generative models can be either predefined or learned from data. For illustrative purposes, most models that we discuss in this book use predefined variables. When designing these models, in practice, the main challenge is deciding which hidden states, observations, and actions are most appropriate for the problem at hand. For example, the perceptual model able to distinguish frogs from apples in chapter 2 only included two hidden states (frogs, apples) and two observations (jumps, does not jump). A more sophisticated model could include additional observations (e.g., red, green) as well as actions such as touching, which produce differential sensory effects (jump or no jump) in the presence of a frog or an apple. Figure 6.2 schematically illustrates a generative model for the concept of a jumping frog. The concept is cast as a hierarchical model, where a single (multimodal or supramodal) hidden state at the center of the figure unfolds in a cascade of (unimodal) hidden states corresponding to percepts in different modalities (exteroceptive, proprioceptive, and interoceptive; see box 6.1) and ultimately causing sensations in the same modalities. This arrangement corresponds to casting the jumping frog concept as the common cause of multiple sensory consequences (e.g., something green and jumping in the visual domain; a croaking sound in the auditory domain), some of which can be action-contingent (e.g., the sight of something jumping may increase on touching it). The inversion of the generative model corresponds to a perceptual inference (e.g., the presence of a jumping frog) from its observed sensory consequences (e.g., the sight of something green and jumpy), and it integrates information across multiple modalities.

Figure 6.2(Hierarchical) generative model for the concept of a jumping frog uses a simplified notation compared to chapter 4: nodes within the dotted circle correspond to hidden states, whereas nodes at the periphery correspond to sensory observations. Beliefs about hidden states, following inversion of the model, correspond to percepts that may be tied to a sensory modality (e.g., visual percept) or may be amodal (e.g., the jumping frog). Action contingencies are represented as dashed lines. Horizontal dependencies between hidden states in different modalities, as well as temporal dependencies between hidden states (as we saw in the dynamical generative models of chapter 4), are ignored for the sake of simplicity.
Figure 6.2(Hierarchical) generative model for the concept of a jumping frog uses a simplified notation compared to chapter 4: nodes within the dotted circle correspond to hidden states, whereas nodes at the periphery correspond to sensory observations. Beliefs about hidden states, following inversion of the model, correspond to percepts that may be tied to a sensory modality (e.g., visual percept) or may be amodal (e.g., the jumping frog). Action contingencies are represented as dashed lines. Horizontal dependencies between hidden states in different modalities, as well as temporal dependencies between hidden states (as we saw in the dynamical generative models of chapter 4), are ignored for the sake of simplicity.

Box 6.1

Varieties of sensory modalities: Exteroceptive, proprioceptive, and interoceptive

In Active Inference, a conceptual distinction is often made between three kinds of sensory modalities: exteroceptive (e.g., vision and audition), proprioceptive (e.g., the sense of joint and limb positions), and interoceptive (e.g., the sense of the internal organs of the body, such as heart and stomach). In multimodal generative models, one can often factorize parts of the model that relate to different modalities; this permits representing that (for example) saccadic movements have visual but not auditory consequences.

Importantly, the same principles of Active Inference operate across all the modalities. For example, in the same way visual processing can be described as the inference about (hidden variables about) a perceptual scene, interoceptive processing can be described as the inference about (hidden variables that report) the internal state of the body. Furthermore, motor actions that change the perceptual scene and internally directed actions that change the interoceptive state can be described in a similar way. The former engages spinal reflexes that fulfill proprioceptive predictions, whereas the latter engages autonomic reflexes that fulfill interoceptive predictions. Such interoceptive processing supports allostasis and adaptive regulation, and its dysfunctions

can have psychopathological consequences (Pezzulo 2013, Seth 2013, Pezzulo and Levin 2015, Seth and Friston 2016, Allen et al. 2019).

Once these variables of interest have been established, the next exercise is to write down the full generative model. One example is the simple generative model for frogs and apples in figure 2.1, which is fully specified by prior beliefs about hidden states and a (likelihood) mapping between hidden states and observations and whose numerical values can be either specified by hand or learned from data (see 6.5.2).

Beyond this simple example, the elements that need to be specified are fully determined by the form of the selected generative model. For example, the model for discrete-time POMDP shown in figure 4.3 (top) requires specifying the A, B, C, D, and E matrices; continuous schemes use analogous (although less alphabetical) elements, which will be dealt with in chapter 8. But even in these more complex cases, the exercise is not so dissimilar from above: namely, specifying prior beliefs about the variables of interest (e.g., in discrete-time implementations, about hidden states at the first time step in the D-vector and about observations in the C-matrix) and their probabilistic mappings (e.g., likelihood mapping between hidden states and observations in the A-matrix). However, in some cases, it is useful to think about factorizations of the state-space of the generative model, which avoids considering every possible combination of variables if some are unnecessary. In chapter 7, we will discuss a biologically plausible example of factorization that occurs in perceptual processing between “what” and “where” streams (Ungerleider and Haxby 1994)—namely, between variables that represent object identities and locations, respectively, which can be treated independently in the model (hence simplifying it) as they are often invariant to one another.

Deciding which variables are of interest and the ways they are related or factorized in the model is often the most challenging—but also the most creative—part of model design. It is an exercise of translating our cognitive hypotheses into a mathematical form that supports Active Inference. How should we select the “right” variables? Ultimately, this is a question of specifying plausible alternatives and picking those that have the lowest free energy (cf. Bayesian model comparison). However, a practically useful perspective for most studies is that the generative model should be as similar as possible to how we believe data are generated. When appealing to Active Inference in the setting of cognitive psychology, this often means thinking about how experimental psychologists would go about generating the stimuli they present to their experimental participants. On formalizing these processes in terms of the requisite probability distributions, we arrive at a generative model whose free energy minimizing dynamics naturally lead to performance of the task in question.

Here, we can draw an analogy with most Bayesian (or ideal observer) models of perception, in which the models are designed to mimic (to a large extent) the structure of the task at hand, as in the example of recognizing a frog or an apple (chapter 2). This idea is sometimes equated with the good regulator theorem (Conant and Ashby 1970), which says that to regulate an environment effectively, a creature (whether biological or synthetic) must be a good model of that system. From the perspective of eco-niche construction, this is sometimes phrased in terms of the (statistical) fitness (Bruineberg et al. 2018) of a creature’s model to its environment (and vice versa). However, this does not mean that an agent’s generative model has to be identical to the generative process that actually generates data. For most practical applications, it can be simplified or different. We will return to this point later in this chapter (6.6).

Box 6.2

Priors and empirical behavior

Another perspective on the issue of selecting priors draws from a set of results known as the complete class theorems (Wald 1947, Daunizeau et al. 2010), which state that any statistical decision procedure (i.e., behavior) may be framed as Bayes optimal under the right set of prior beliefs. This means that if we are interested in explaining empirical behavior, our challenge is to identify the generative model (comprising prior beliefs) that would reproduce that behavior as simply as possible. In short, priors are a statement of a hypothesis about the system in question. If other prior beliefs would be plausible, this offers an opportunity to put this to empirical data through Bayesian model comparison. This also has implications for computational phenotyping in clinical populations. That there will always be a set of prior beliefs that render behavior Bayes optimal implies the key question—in understanding the computational deficits that give rise to psychiatric or neurological syndromes—is what these priors are. This idea is slightly counterintuitive at first. However, the complete class theorem means that asking whether a behavior is (Bayes) optimal is meaningless. The important question is, What are the prior beliefs that would make this optimal? In chapter 9, we will see how an appeal to free energy minimization based on our own beliefs as scientists offers a way to answer this question.

6.5.2. Which Parts of the Generative Model Are Fixed, and What Is Learned?

Another design choice is deciding which parts of the generative model are fixed and which ones are updated over time as an effect of learning. In principle, Active Inference allows every part of the model—and even its structure—to be updated (or learned) over time. This renders learning a design choice rather than something mandatory. In keeping with this, we will cover examples of Active Inference models that are completely designed by hand and examples in which some parts of the model (e.g., transition probabilities) remain fixed while others (e.g., likelihoods) are updated over time.

In Active Inference, learning is cast as an aspect of inference, as a free energy minimizing process. So far, we have described inference in terms of an update of beliefs about states of the generative model. In much the same way, we can describe learning as an update of beliefs about parameters of the generative model. For this, the generative model has to be endowed with prior beliefs about parameters of the distributions to be learned, where the specific parameters depend on probability distribution associated with each variable (e.g., mean and variance for a Gaussian distribution). These prior values are updated to form posterior beliefs whenever new data are encountered. As we will discuss in chapter 7, the algorithmic form of this update is the same as the update of state variables.

The fact that both inference and learning use the same kind of Bayesian belief updates may seem confusing during model design—partly because deciding what should be modeled as a state or a parameter is not always straightforward. However, when it comes to cognitive models, there is a clear difference between inference and learning. Inference describes (fast) changes of our beliefs about model states—for example, how we update our belief that there is an apple in front of us after observing something red. Learning describes (slow) changes of our beliefs about model parameters— for example, how we update our likelihood distribution to increase the value of the apples-red mapping after observing several occurrences of red apples. Beliefs about parameters typically vary much more slowly than those about states, and they may only be updated after states have been inferred. From a neurobiological perspective, it is appealing to map inference to neuronal dynamics and learning to synaptic plasticity. Furthermore, as we will discuss in chapter 7, holding probabilistic beliefs about model parameters induces novelty-seeking behaviors so that creatures may select the best data to learn the causal structure of their worlds. This suggests that endowing Active Inference models with the ability to learn their parameters (or even their structure; see chapter 7) is an effective way to study the behavioral dynamics of active learning and curiosity-based exploration.

Before concluding this section, it is worth noting that in this book we exemplify rather simple generative models that are defined using tabular methods (e.g., with explicit matrices for priors and likelihoods) and that operate in small state-spaces. In comparison, much more sophisticated kinds of generative models—and associated learning schemes—are being developed in fields like machine learning, deep learning, and robotics, such as, for example, variational autoencoders (Kingma and Welling 2014), generative adversarial networks (Goodfellow et al. 2014), recursive cortical networks (George et al. 2017), and world models (Ha and Schmidhuber 2018). In principle, one could borrow any of these methods (and many others) to implement one or more parts of Active Inference models (e.g., likelihood or transition models). By leveraging the most up-to-date machine learning methods, it would be possible to scale up Active Inference to increasingly more challenging domains and applications; see, for example, Ueltzhoffer (2018) and Millidge (2019).

However, there are some important points to consider when designing Active Inference models that use sophisticated machine learning models, especially if one is interested in cognitive and neurobiological implications. One appeal of Active Inference is that it offers an integrative perspective on cognitive functions by assuming that (for example) perceptual inference, action planning, and learning all stem from the same free energy minimization process. This integrative power would be lost if (for example) one juxtaposed generative models that operate or learn independently from one another. Furthermore, the aforementioned machine learning methods correspond to process models that are distinct from Active Inference and have different cognitive and neurobiological interpretations. Finally, when using machine learning methods, some of the design choices discussed here (e.g., about the choice of model variables) may be skipped, as they are emergent properties of learning; however, they may be replaced by different design choices, about (for example) number of layers, parameters, and learning rates of a deep neural net. These design choices potentially have relevant cognitive and neurobiological implications, which are beyond the scope of what we address here.

6.6 Setting Up the Generative Process

In Active Inference, the generative process describes the dynamics of the world external to the Active Inference agent, which corresponds to the process that determines the agent’s observations (see figure 6.1). It may seem bizarre to have postponed defining the generative process until after describing the agent’s generative model. After all, a modeler would have some task (and generative process) in mind from the beginning, so it would make perfect sense to revert this order and design the generative process before the generative model, especially in applications where the generative model has to be learned during situated interactions, as in gamelike or robotic settings (Ueltzhoffer 2018, Millidge 2019, Sancaktar et al. 2020).

The reason we postponed the design of the generative process is that, in many practical applications discussed in this book, we simply assume that the dynamics of the generative process are the same as, or very similar to, the generative model. In other words, we generally assume that the agent’s generative model closely mimics the process that generates its observations. This is not the same as saying that the agent has perfect knowledge of the environment. Indeed, even if the agent knows the process that generates its observations, it may be uncertain about (for example) its initial state in the process, as was the case in the apple versus frog example. In the language of discrete-time Active Inference, one could design a model in which both the generative model and the generative process are characterized by the same A-matrix but in which the agent’s belief about its initial state (D-vector), which is part of its generative model, is different from—or even inconsistent with—the true initial state of the generative process. One subtle thing to notice is that even if both the generative model and the generative process are characterized by the same A- and B-matrices, their semantics are different. The A-matrix of the generative process is an objective property of the environment (sometimes called a measurement distribution in Bayesian models), whereas the A-matrix of the generative model encodes an agent’s subjective belief (called a likelihood function in Bayesian models).

Of course, except in the simplest cases, it is not mandatory that the generative model and generative process are the same. In practical implementations of Active Inference, one can always specify the generative process separately from the generative model, either using equations that differ from those of the generative model or using other methods, such as game simulators, which take actions as inputs and provide observations as outputs (Cullen et al. 2018), thereby following the usual action-perception loop implied by the Markov blanket of figure 6.1.

There are some philosophical implications of designing generative models that are similar or dissimilar from the generative process (Hohwy 2013; Clark 2015; Pezzulo, Donnarumma et al. 2017; Nave et al. 2020, Tschantz et al. 2020). As discussed above, the good regulator theorem (Conant and Ashby 1970) says that an effective adaptive creature must have or be a good model of the system it regulates. However, this can be achieved in various ways. First, as discussed so far, the creature’s generative model can mimic (at least to a great extent) the generative process. Models developed in this way may be called explicit or environmental models, given the resemblance between their internal states and the environment’s external states. Second, the creature’s generative model can be much more parsimonious than (and even significantly different from) the generative process, to the extent that it correctly manages those aspects of the environment that are useful to act adaptively in it and achieve the creature’s goals. Models developed in this way may be called sensorimotor or action oriented, as they mostly encode action-observation (or sensorimotor) contingencies and their primary role is supporting goal-directed actions as opposed to providing an accurate description of the environment.

The difference between explicit and action-oriented models can be appreciated if we consider different ways one can model (for example) a rodent trying to escape from a maze in which some corridors are dead ends. An explicit generative model may resemble a cognitive map of the maze and provide a detailed characterization of external entities, such as specific locations, corridors, and dead ends. This model may permit the rodent to escape from the maze using map-based navigation. An action-oriented model may instead encode contingencies between whisker movements and touch sensations. This latter model would afford the selection of contextually appropriate strategies, such as moving forward (if no touch sensation is experienced or expected) or changing direction (in the opposite case)— eventually permitting the rodent to escape from the maze without explicitly representing locations, corridors, or dead ends. These two kinds of model prompt different philosophical interpretations of Active Inference, considering generative models as ways to either reconstruct the external environment (explicit) or afford accurate action control (action oriented).

Finally, as discussed in the field of morphological computation (Pfeifer and Bongard 2006), some aspects of a creature’s or a robot’s control can be outsourced to the body and hence do not need to be encoded in its generative model. One example is the passive dynamic walker: a physical object resembling a human body, composed of two “legs” and two “arms,” which is able

to walk an incline with no sensors, motors, or controllers (Collins et al. 2016). This example implies that at least some aspects of locomotion (or other abilities) can be achieved with body mechanics that are carefully tuned to exploit environmental contingencies (e.g., an appropriate body weight or size to walk without slipping); therefore, these contingencies do not need to be encoded in the creature’s generative model. This suggests an alternative way to design Active Inference agents (and their bodies) that are—as opposed to have—good models of their environment. Yet all the ways to design Active Inference models are not mutually alternative but can be appropriately combined, depending on the problem of interest.

6.7 Simulating, Visualizing, Analyzing, and Fitting Data Using Active Inference

In most practical applications, once the generative model and generative process have been defined, one only needs to use the standard procedure of Active Inference—the descent of the active and internal states on the free energy functional associated with the model—to obtain numerical results. Arguably, modelers’ goals are to simulate, visualize, analyze, and fit data (e.g., conduct model-based data analysis). Standard routines for Active Inference that provide support for all these functions are freely available (https://www.fil.ion.ucl.ac.uk/spm/); an annotated example of using these routines is provided in appendix C.

Although in most cases Active Inference procedures function off-theshelf, in some practical applications one may consider specific fine-tunings or changes. For example, specifying the temporal depth of planning defines how many future states are considered during expected free energy computations. Setting up a limited temporal depth, along with other approximations to exhaustive search such as sampling (Fountas et al. 2020), may be useful in practical applications of Active Inference in large state-spaces.

Another example of adapting the standard functioning of Active Inference is the selective removal of parts of the expected free energy equation. This ablation may be useful to compare standard Active Inference (that uses expected free energy) with reduced versions, in which some parts of the expected free energy are suppressed to render them formally analogous to (for example) KL control or utility maximization systems (Friston, Rigoli et al. 2015). Furthermore, one can also augment Active Inference models with additional mechanisms, such as habitual learning (Friston, FitzGerald et al. 2016) or learning rate modulation (Sales et al. 2019), with the caveat that maintaining the normative character of Active Inference would require casting these additional mechanisms in terms of free energy minimization.

Finally, other fine-tunings or changes to Active Inference may be useful to characterize disorders of inference and psychopathological conditions— for example, to explore the behavioral and neuronal consequences of endowing a creature’s generative model with excessively strong (or weak) priors via excessively high (or low) levels of neuromodulators. We will provide some examples of Active Inference models that are relevant for psychopathology in chapter 9.

6.8 Summary

In this chapter, we have outlined the most important design choices that must be made in setting up an Active Inference model. We provided a recipe in four steps and some guidelines to address the usual challenges that model designers face. Of course, it is not necessary to follow the recipe in a rigid manner. Some steps can be inverted (e.g., design the generative process before the generative model) or combined. But in general, these steps are all required. This sets up the remainder of this book, which puts these ideas into practice through a series of illustrative examples designed to showcase the theoretical principles presented in the first half of the book. In everything that follows, the only differences among the examples rest on the design choices we have highlighted here. Part 2 illustrates systems with different boundaries, with discrete or continuous dynamics at different timescales, for which the choice of prior beliefs is fundamental in reproducing behavior across many different domains—but all implementing the same Active Inference.

10 . Active Inference as a Unified Theory of Sentient Behavior

In general we are least aware of what our minds do best.

—Marvin Minsky

10.1. Introduction

In this chapter, we wrap up Active Inference’s main theoretical points (from the first part of the book) and its practical implementations (from the second part). Then, we connect the dots: we abstract away from the specific Active Inference models discussed in previous chapters to focus on integrative aspects of the framework. One benefit of Active Inference is that it provides a complete solution to the adaptive problems that sentient organisms have to solve. It therefore offers a unified perspective on problems like perception, action selection, attention, and emotion regulation, which are usually treated in isolation in psychology and neuroscience—and addressed using distinct computational approaches in artificial intelligence. We will discuss the Active Inference perspective on each of these problems (and more) in the context of established theories, such as cybernetics, ideomotor theory of action, reinforcement learning, and optimal control. Finally, we briefly discuss how the scope of Active Inference can be extended to cover other biological, social, and technological topics that are not discussed in depth in this book.

10.2. Wrapping Up

This book offers a systematic account of the theoretical underpinnings and practical implementations of Active Inference. Here, we briefly summarize the discussion of the first nine chapters. This offers an opportunity to rehearse the key constructs of Active Inference that will be useful in the remainder of this chapter.

In chapter 1, we introduced Active Inference as a normative approach to understanding sentient creatures that form part of action-perception loops with their environment (Fuster 2004). We explained that normative approaches start from first principles to derive and test empirical predictions about the phenomenon of interest—here, the ways living organisms persist while engaging in adaptive exchanges (action-perception loops) with their environment. We also considered that one could arrive at Active Inference by following a low road or a high road.

In chapter 2, we illustrated the low road to Active Inference. This road starts from the idea that the brain is a prediction machine, endowed with a generative model: a probabilistic representation of how hidden causes in the world generate sensations (e.g., how light reflected off an apple stimulates the retina). By inverting this model, it infers the causes of its sensations (e.g., whether I am seeing an apple, given that my retina is stimulated in a certain way). This view of perception (aka perception-as-inference) has its historical roots in the Helmholtzian notion of unconscious inference and, more recently, in the Bayesian brain hypothesis. Active Inference extends this view by bringing action control and planning within the compass of inference (aka control-as-inference, planning-as-inference). Most importantly, it shows that perception and action are not quintessentially separable processes but fulfill the same objective. We first described this objective more informally, as the minimization of a discrepancy between one’s model and the world (which generally reduces to surprise or prediction error minimization). Put simply, one can minimize the discrepancy between a model and the world in two ways: by changing one’s mind to fit the world (perception) or by changing the world to fit the model (action). These can be described in terms of Bayesian inference. However, exact inference is often intractable, so Active Inference uses a (variational) approximation (noticing that exact inference may be seen as a special case of approximate inference). This leads to the second, more formal description of the common objective of perception and action, as variational free energy minimization. This is the core quantity used in Active Inference and may be unpacked in terms of its constituent parts (e.g., energy and entropy, complexity and accuracy, or surprise and divergence). Finally, we introduced a second kind of free energy: expected free energy. This is particularly important during planning, as it affords a way to score alternative policies by considering the future outcome that they are expected to generate. This too may be unpacked in terms of its constituent parts (e.g., information gain and pragmatic value, expected ambiguity and risk).

In chapter 3, we illustrated the high road to Active Inference. This alternative road starts from the deflationary imperative for biological organisms to preserve their integrity and avoid dissipation, which can be described as avoiding surprising states. We then introduced the notion of a Markov blanket: a formalization of the statistical separation between the organism’s internal states and the world’s external states. Crucially, internal and external states can only influence each other vicariously via intermediate (active and sensory) variables, called blanket states. This statistical separation— mediated by the Markov blanket—is crucial to endowing an organism with some degree of autonomy from the external world. To understand why this is a useful perspective, consider the following three consequences.

First, an organism with a Markov blanket appears to model the external environment in a Bayesian sense: its internal states correspond—on average—to an approximate posterior belief about external states of the world. Second, the autonomy is guaranteed by the fact that the organism’s model (its internal states) is not unbiased but prescribes some existential preconditions (or prior preferences) that must be maintained—for example, for a fish, being in the water. Third, equipped with this formalism, it is possible to describe optimal behavior (with respect to prior preferences) as the maximization of (Bayesian) model evidence by perception and action. By maximizing model evidence (i.e., self-evidencing) an organism ensures that it realizes its prior preferences (e.g., a fish stays in the water) and avoids surprising states. In turn, the maximization of model evidence is (approximately) mathematically equivalent to the minimization of variational free energy—hence we atrive again (in another way) at the same central construct of Active Inference discussed in chapter 2. Finally, we detailed the relationship between minimizing surprise and Hamilton’s principle of least Action. This evinces the formal relationship between Active Inference and first principles in statistical physics.

In chapter 4, we outlined the formal aspects of Active Inference. We focused on the passage from Bayesian inference to a tractable approximation— variational inference—and the resulting objective for organisms to mini-mize variational free energy via perception and action. The insight from this treatment is the importance of the generative model that creatures use to make sense of their world. We introduced two kinds of generative models that express our beliefs about how data are generated, using discrete or continuous variables. We explained that both afford the same Active Inference, but they apply when states of affairs are formulated in discrete time (as partially observed Markov decision problems) or continuous time (as stochastic differential equations), respectively.

In chapter 5, we remarked on the difference between the normative principle of free energy minimization and a process theory about how this principle may be implemented by the brain—and explained that the latter generates testable predictions. We then outlined aspects of the process theories accompanying Active Inference, which encompass domains such as neuronal message passing, including neuroanatomical circuitry (e.g., cortico-subcortical loops) and neuromodulation. For example, at an anatomical level, message passing maps nicely to a canonical cortical microcircuit, with predictions that stem from deep cortical layers at one level and target superficial cortical layers at the level below (Bastos et al. 2012). Ata more systemic level, we discussed how Bayesian inference, learning, and precision weighting correspond to neuronal dynamics, synaptic plasticity, and neuromodulation, respectively, and how the top-down and bottom-up neural message passing of predictive coding maps to slower (e.g., alpha or beta) and faster (e.g., gamma) brain rhythms. These and other examples illustrate that after designing a specific Active Inference model, one can draw neurobiological implications from the form of its generative model.

In chapter 6, we provided a recipe to design Active Inference models. We saw that while all creatures minimize their variational free energy, they behave in different, sometimes opposite ways because they are endowed with different generative models. Therefore, what distinguishes different (e.g., simpler from more complex) creatures is just their generative model. There is a rich repertoire of possible generative models, which correspond to different biological (e.g., neuronal) implementations and produce different adaptive—or maladaptive—behaviors in different contexts and ecological niches. This renders Active Inference equally appropriate for characterizing simple creatures like bacteria that sense and seek nutrient gradients, complex creatures like us that pursue sophisticated goals and engage in rich cultural practices, or even different individuals—to the extent that ones appropriately characterizes their respective generative models. Evolution appears to have discovered increasingly sophisticated design structures for brains and bodies that made organisms able to deal with (and shape) rich ecological niches. Modelers can reverse-engineer this process and specify the designs for brains and bodies of creatures of interest, in terms of generative models, based on the kinds of niche they occupy. This corresponds to a series of design choices (e.g., models using discrete or categorical variables, shallow or hierarchical models)—which we unpacked in the chapter.

In chapters 7 and 8, we provided numerous examples of Active Inference models in discrete and continuous time, which address problems of perceptual inference, goal-directed navigation, model learning, action control, and more. These examples were designed to showcase the variety of emergent behaviors under these models and to detail the principles of how they are specified practically.

In chapter 9, we discussed how to use Active Inference for model-based data analysis and to recover the parameters of an individual’s generative model, which better explain the subject’s behavior in a task. This computational phenotyping uses the same form of Bayesian inference discussed in the rest of the book, but in a different way: it helps design and evaluate (objective) models of others’ (subjective) models.

10.3 Connecting the Dots: The Integrative Perspective of Active Inference

Some decades ago, the philosopher Dennett lamented that cognitive scientists devoted too much effort to modeling isolated subsystems (e.g., perception, language understanding) whose boundaries are often arbitrary. He suggested to try instead modeling “the whole iguana”: a complete cognitive creature (perhaps a simple one) and an environmental niche for it to cope with (Dennett 1978).

One benefit of Active Inference is that it offers a first principle account of the ways in which organisms solve their adaptive problems. The normative approach pursued in this book assumes that it is possible to start from the principle of variational free energy minimization and derive implications about specific cognitive processes, such as perception, action selection, attention and emotion regulation, and their neuronal underpinnings.

Imagine a simple creature that must solve problems like finding food or shelter. When cast as Active Inference, the creature’s problems can be described in enactive terms, as acting to solicit preferred sensations (e.g., food-related sensations). To the extent that these preferred sensations are included (as prior beliefs) in its generative model, the organism is effectively gathering evidence for its model—or, more allegorically, for its existence (i.e, maximizing model evidence or self-evidencing). This simple principle has ramifications for psychological functions traditionally considered in isolation, such as perception, action control, memory, attention, intention, emotion, and more. For example, perception and action are both selfevidencing, in the sense that a creature can align what it expects, given its generative model, with what it senses either by changing its beliefs (about the presence of food) or by changing the world (soliciting food-related sensations). Memory and attention can also be thought of as optimizing the same objective. Long-term memory develops through learning the parameters of a generative model. Working memory is belief updating when beliefs are about external states in the past and future. Attention is the optimization of beliefs about the precision of sensory input. Forms of planning (and intentionality) can be conceptualized by appealing to the capacity of (some) creatures to select among alternative futures, which in turn requires temporally deep generative models. These predict the outcomes that would result from a course of action and are optimistic about these outcomes. This optimism manifests as the belief that future outcomes will lead to preferred outcomes. Deep temporal models can also help us understand sophisticated forms of prospection (where beliefs about the present are used to derive beliefs about the future) and retrospection (where beliefs about the present are used to update beliefs about the past). Forms of interoceptive regulation and emotion can be conceptualized by appealing to generative models of internal physiology that predict the allostatic consequences of future events.

As the above examples illustrate, there is an important consequence of studying cognition and behavior from the perspective of a normative theory of sentient behavior. Such theory does not start by assembling separate cognitive functions, such as perception, decision-making, and planning. Rather, it starts by providing a complete solution to the problems that organisms have to solve and then analyzing the solution to derive implications about cognitive functions. For example, which mechanisms permit a living organism or artificial creature (e.g., a robot) to perceive the world, remember it, or plan (Verschure et al. 2003, 2014; Verschure 2012; Pezzulo, Barsalou et al. 2013; Krakauer et al. 2017)? This is an important move as the taxonomies of cognitive functions—used in psychology and neuroscience textbooks—largely inherit from early philosophical and psychological theories (sometimes called Jamesian categories). Despite their great heuristic value, they may be quite arbitrary—or they may not correspond to separate cognitive and neural processes (Pezzulo and Cisek 2016, Buzsaki 2019, Cisek 2019). Indeed, these Jamesian categories may be candidates for how our generative models explain our engagement with the sensorium—as opposed to explaining that engagement. For example, the solipsistic hypothesis that “I am perceiving” is just my explanation for current states of affairs that include my belief updating.

Adopting a normative perspective may also help in identifying formal analogies between cognitive phenomena studied in different domains. One example is the trade-off between exploration and exploitation, which appears in various guises (Hills et al. 2015). This trade-off is often studied during foraging, when creatures must choose between exploiting previous successful plans and exploring novel (potentially better) ones. However, the same trade-off occurs during memory search and deliberation with limited resources (e.g., time limitations or search effort), when creatures have the choice between exploiting their current best plan versus investing more time and cognitive effort to explore additional possibilities. Characterizing these apparently disconnected phenomena in terms of free energy can potentially reveal deep similarities (Friston, Rigoli et al. 2015; Pezzulo, Cartoni et al. 2016; Gottwald and Braun 2020).

Finally, in addition to a unified perspective on psychological phenomena, Active Inference offers a principled means of understanding the corresponding neural computations. In other words, it offers a process theory that connects cognitive processing to (expected) neuronal dynamics. Active Inference assumes that everything that matters about brains, minds, and behavior can be described in terms of the minimization of variational free energy. In turn, this minimization has specific neural signatures (in terms of, e.g., message passing or brain anatomy) that can be empirically validated.

In the rest of this chapter, we explore some implications of Active Inference for psychological functions—as if we were sketching a psychology textbook. For each of these functions, we also highlight some points of contact (or divergence) between Active Inference and other popular theories in the literature.

10.4 Predictive Brains, Predictive Minds, and Predictive Processing

I have this picture of pure joy

it’s of a child with a gun

he’s aiming straight in front of himself, shooting at something that isn’t there.

—Afterhours, “Quello che non c’é” (Something that isn’t there)

Traditional theories of brain and cognition emphasize feedforward transductions from external stimuli to internal representations and then motor actions. This has been called a “sandwich model,” as everything that is in between stimuli and responses is assigned the label “cognitive” (Hurley 2008). In this perspective, the main function of the brain is to transform incoming stimuli into contextually appropriate responses.

Active Inference departs significantly from this view by emphasizing predictive and goal-directed aspects of brain and cognition. In psychological terms, Active Inference creatures (or their brains) are probabilistic inference machines, which continuously generate predictions based on their generative models.

Self-evidencing creatures use their predictions in two fundamental ways. First, they compare predictions with incoming data to validate their hypotheses (predictive coding) and—at a slower timescale—revise their models (learning). Second, they enact predictions to guide the ways they gather data (Active Inference). By doing so, Active Inference creatures fulfill two imperatives: epistemic (e.g., visually exploring places where salient information is present that can resolve uncertainty about hypotheses or models) and pragmatic (e.g., moving to locations where preferred observations such as rewards can be secured). The epistemic imperative renders both perception and learning active processes, whereas the pragmatic imperative renders behavior goal directed.

10.4.1 Predictive Processing

This predictive- and goal-centric view of brain—and cognition—is closely related to (and provided inspiration for) predictive processing (PP): an emerging framework in philosophy of mind and epistemology, which sees prediction as central to brain and cognition and appeals to concepts of “predictive brains” or “predictive minds” (Clark 2013, 2015; Hohwy 2013).

Sometimes PP theories appeal to the specific functioning of Active Inference and some of its constructs, such as generative models, predictive coding, free energy, precision control, and Markov blankets, but they sometimes appeal to other constructs, such as coupled inverse and forward models, which are not part of Active Inference. Therefore, the term predictive processing is used in a broader (and less constrained) sense compared to Active Inference.

Predictive processing theories have attracted considerable attention in philosophy, given their potential for unification in many senses: across multiple domains of cognition, including perception, action, learning, and psychopathology; from lower (e.g., sensorimotor) to higher levels of cognitive processing (e.g., psychological constructs); from simple biological organisms to brains, individuals, and social and cultural constructs. Another appeal of PP theories is that they make use of conceptual terms, such as beliefs and surprise, which speak to a psychological level of analysis familiar to philosophers (with the caveat that sometimes these terms may have technical meanings that differ from common usage).

Yet, as the interest in PP grows, it has become apparent that philosophers have different opinions on its theoretical and epistemological implications. For example, it has been interpreted in internalist (Hohwy 2013), embodied or action-based (Clark 2015), and enactivist and nonrepresentational terms (Bruineberg et al. 2016, Ramstead et al. 2019). The debate around these conceptual interpretations goes beyond the scope of this book.

10.5 Perception

You can’t depend on your eyes when your imagination is out of focus.

—Mark Twain

Active Inference considers perception as an inferential process based on a generative model of how sensory observations are generated. Bayes’ rule essentially inverts the model to compute a belief about the hidden state of the environment, given the observations. This idea of perception-as-inference dates back to Helmholtz (1866) and was often reproposed in psychology, computational neuroscience, and machine learning (e.g., analysis-by-synthesis) (Gregory 1980, Dayan et al. 1995, Mesulam 1998, Yuille and Kersten 2006). This generative modeling approach has been demonstrated to be effective in facing challenging perceptual problems, such as breaking text-based CAPTCHAs (George et al. 2017).

10.5.1. Bayesian Brain Hypothesis

The most prominent contemporary expression of this idea is the Bayesian brain hypothesis, which has been applied to several domains such as decision-making, sensory processing, and learning (Doya 2007). Active Inference provides a normative foundation to these inferential ideas by deriving them from the imperative of minimizing variational free energy. As the same imperative extends to action dynamics, Active Inference naturally models active perception and the ways in which organisms actively sample observations to test their hypotheses (Gregory 1980). Under the Bayesian brain agenda, instead, perception and action are modeled in terms of different imperatives (where action requires Bayesian decision theory; see section 10.7.1).

More broadly, the Bayesian brain hypothesis refers to a family of approaches that are not necessarily integrated and often make different empirical predictions. These include, for example, the computational-level proposal that the brain performs Bayes-optimal sensorimotor and multisensory integration (Kording and Wolpert 2006), the algorithmic-level proposal that the brain implements specific approximations of Bayesian inference, such as decision-by-sampling (Stewart et al. 2006), and the neural-level proposals about the specific ways in which neural populations may perform probabilistic computations or encode probability distributions—for example, as samples or probabilistic population codes (Fiser et al. 2010, Pouget et al. 2013). At each level of explanation, there are competing theories on the field. For example, it is common to appeal to approximations of exact Bayesian inference to explain deviations from optimal behavior, but different works consider different (and not always compatible) approximations, such as different sampling approaches. More broadly, the relations between proposals at different levels are not always straightforward. This is because Bayesian computations can be realized (or approximated) in multiple algorithmic ways, even without explicitly representing probability distributions (Aitchison and Lengyel 2017).

Active Inference provides a more integrated perspective that connects normative principles and process theories. At the normative level, its central assumption is that all processes minimize variational free energy. The corresponding process theory for inference uses a gradient descent on free energy, which has clear neurophysiological implications, explored in chapter 5 (Friston, FitzGerald et al. 2016). More broadly, one can start from the principle of free energy minimization to derive implications about brain architectures.

For example, the canonical process model of perceptual inference (in continuous time) is predictive coding. Predictive coding was initially proposed as a theory of hierarchical perceptual processing by Rao and Ballard (1999) to explain a range of documented top-down effects, which were difficult to reconcile with feedforward architectures as well as known physiological facts (e.g., the existence of forward, or bottom-up, and backward, or top-down, connections in sensory hierarchies). However, predictive coding can be derived from the principle of free energy minimization, under some assumptions, such as the Laplace approximation (Friston 2005). Furthermore, Active Inference in continuous time can be constructed as a directed extension of predictive coding into the domain of action—by endowing a predictive coding agent with motor reflexes (Shipp et al. 2013). This leads us to the next point.

10.6 Action Control

If you can’t fly then run, if you can’t run then walk, if you can’t walk then crawl, but whatever you do you have to keep moving forward.

—Martin Luther King

In Active Inference, action processing is analogous to perceptual processing, as both are guided by forward predictions—exteroceptive and proprioceptive, respectively. It is the (proprioceptive) prediction that “my hand grasps the cup” that induces a grasping movement. The equivalence between action and perception exists also at the neurobiological level: the architecture of the motor cortex is organized in the same way as the sensory cortex—as a predictive coding architecture, with the exceptions that it can influence motor reflexes in the brain stem and spine (Shipp et al. 2013) and that it receives relatively little ascending input. Motor reflexes permit controlling movement by setting “equilibrium points” along a desired trajectory—an idea that corresponds to the equilibrium point hypothesis (Feldman 2009).

Importantly, initiating an action—like grasping a cup—requires regulation of the precision (inverse variance) of prior beliefs and sensory streams appropriately. This is because the relative values of these precisions determine the way in which a creature manages the conflict between its prior belief (that it holds the cup) and its sensory input (signaling that it does not). An imprecise prior belief about grasping a cup can be easily revised in the light of conflicting sensory evidence—producing a change of mind and no action. Rather, when the prior belief dominates (i.e., has higher precision), it is maintained even in the face of conflicting sensory evidence—and it induces a grasping action to resolve the conflict. To ensure that this is the case, action initiation induces a transient sensory attenuation (or downweighting sensory prediction errors). Failure of this sensory attenuation can have maladaptive consequences, such as the failure to initiate or control movements (Brown et al. 2013).

10.6.1 Ideomotor Theory

In Active Inference, action stems from (proprioceptive) predictions and not motor commands (Adams, Shipp, and Friston 2013). This idea connects Active Inference to ideomotor theory of action: a framework to understand action control that dates back to William James (1890) and the later theories of “event coding” and “anticipatory behavioural control” (Hommel et al. 2001, Hoffmann 2003). Ideomotor theory suggests that action-effect links (similar to forward models) are key mechanisms in the architecture of cognition. Importantly, these links can be used bidirectionally. When they are used in the action-effect direction, they permit generating sensory predictions; when they are used in the effect-action direction, they permit selecting actions that achieve desired perceptual consequences—implying that actions are selected and controlled on the basis of their predicted consequences (hence the term ideo+motor). This anticipatory view of action control is supported by a body of literature that documents the effects of (anticipated) action consequences on action selection and execution (Kunde et al. 2004). Active Inference provides a mathematical characterization of this idea that also includes additional mechanisms, such as the importance of precision control and sensory attenuation, which are not fully investigated in (but are compatible with) ideomotor theory.

10.6.2 Cybernetics

Active Inference is closely related to cybernetic ideas about the purposeful, goal-directed nature of behavior and the importance of (feedback-based) agent-environment interactions, as exemplified by the TOTE (Test, Operate, Test, Exit) and related models (Miller et al. 1960; Pezzulo, Baldassarre et al. 2006). In both TOTE and Active Inference, the selection of actions is determined by the discrepancy between a preferred (goal) state and the current state. These approaches diverge from simple stimulus-response relationships, as more commonly assumed in behaviorist theory and computational frameworks like reinforcement learning (Sutton and Barto 1998).

The notion of action control in Active Inference is particularly akin to perceptual control theory (Powers 1973). Central to perceptual control theory was the notion that what is controlled is a perceptual state, not a motor output or action. For example, while driving, what we control—and keep stable over time in the face of disturbances—is our reference or desired velocity (e.g., 90 mph), as signaled by the speedometer, whereas the actions we select for this (e.g., accelerating or decelerating) are more variable and context dependent. For example, depending on the disturbance (e.g., wind, a steep road, or other cars), we would need to either accelerate or decelerate to maintain the reference velocity. This view implements William James’s (1890) suggestion that “humans achieve stable goals via flexible means.”

While in both Active Inference and perceptual control theory it is a perceptual (and specifically a proprioceptive) prediction that controls action, the two theories differ in how control is operated. In Active Inference but not perceptual control theory, action control has anticipatory or feedforward aspects, based on generative models. In contrast, perceptual control theory assumes that feedback mechanisms are largely sufficient to control behavior, whereas trying to predict a disturbance, or exerting feedforward (or open-loop) control, is worthless. However, this objection was mainly intended to address the limitations of control theories that use inverseforward models (see next section). Under Active Inference, generative or forward models are not used to predict a disturbance but to predict future (desired) states and trajectories to be fulfilled by acting—and to infer the latent cause of perceptual events.

Finally, another important point of contact between Active Inference and perceptual control theory is the way they conceptualize control hierarchies. Perceptual control theory proposes that higher hierarchical levels control lower hierarchical levels by setting their reference points or set-points (i.e., what they have to achieve) by leaving them free to select the means to achieve them rather than by setting or biasing the actions that the lower levels have to perform (i.e., how to operate). This stands in contrast with most theories of hierarchical and top-down control, in which higher levels either directly select plans (Botvinick 2008) or bias the selection of actions or motor commands at lower hierarchical levels (Miller and Cohen 2001). Similar to perceptual control theory, in Active Inference one can decompose hierarchical control in terms of a (top-down) cascade of goals and subgoals, which can be autonomously achieved at the appropriate (lower) levels. Furthermore, in Active Inference, the contribution of goals represented at different levels of the control hierarchy can be modulated (precision weighted) by motivational processes, in such a way that the more salient or urgent goals are prioritized (Pezzulo, Rigoli, and Friston 2015, 2018).

10.6.3 Optimal Control Theory

The way Active Inference accounts for action control is significantly different from other models of control in neuroscience, such as optimal control theory (Todorov 2004, Shadmehr et al. 2010). This framework assumes that the brain’s motor cortex selects actions using a (reactive) control policy that maps stimuli to responses. Active Inference, instead, assumes that the motor cortex conveys predictions, not commands.

Furthermore, while both optimal control theory and Active Inference appeal to internal models, they describe internal modeling in different ways (Friston 2011). In optimal control, there is a distinction between two kinds of internal models: inverse models encode stimulus-response contingencies and select motor commands (according to some cost function), whereas forward models encode action-outcome contingencies and provide inverse models with simulated inputs to replace noisy or delayed feedback, hence going beyond a pure feedback control scheme. Inverse and forward models can also operate in a loop that is detached from external actionperception (i.e., when inputs and outputs are suppressed) to support internal, “what if” simulations of action sequences. Such internal simulations of action have been linked to various cognitive functions, such as planning, action perception, and imitation in social domains (Jeannerod 2001, Wolpert et al. 2003) as well as various disorders of movement and psychopathologies (Frith et al. 2000).

In contrast to the forward-inverse modeling scheme, in Active Inference forward (generative) models do the heavy lifting of action control, whereas inverse models are minimalistic and often reduce to simple reflexes resolved at the peripheral level (i.e., in the brain stem or spinal cord). Action is initiated when there is a difference between anticipated and observed states (e.g., desired, current arm positions)—that is, a sensory prediction error. This means a motor command is equivalent to a prediction made by the forward model as opposed to something computed by an inverse model as in optimal control. The sensory (more precisely, proprioceptive) prediction error is resolved by an action (i.e., arm movement). The gap to be filled by action is considered so small that it does not require a sophisticated inverse model but a much simpler motor reflex (Adams, Shipp, and Friston 2013).! What renders a motor reflex simpler than an inverse model is that it does not encode a mapping from inferred states of the world to action but a much simpler mapping between action and sensory consequences. See Friston, Daunizeau et al. (2010) for further discussion.

Another crucial difference between optimal motor control and Active Inference is that the former uses a notion of cost or value function to motivate action, whereas the latter replaces it with the Bayesian notion of prior (or prior preference, implicit in expected free energy)—as we discuss in the next section.

10.7 Utility and Decision-Making

Action expresses priorities.

—Mahatma Gandhi

The notion of a cost or value function of states is central in many fields, such as optimal motor control, economic theories of utility maximization, and reinforcement learning. For example, in optimal control theory, the optimal control policy for a reaching task is often defined as the one that minimizes a specific cost function (e.g., is smoother or has minimum jerk). In reinforcement learning problems, such as navigating in a maze that includes one or more rewards, the optimal policy is the one that permits maximizing (discounted) reward while also minimizing movement costs. These problems are often solved using the Bellman equation (or the Hamilton-JacobiBellman equation in continuous time), whose general idea is that the value of a decision can be decomposed in two parts: the immediate reward and the value of the remaining part of the decision problem. This decomposition affords the iterative procedure of dynamic programming, which is at the core of control theory and reinforcement learning (RL) (Bellman 1954).

Active Inference differs from the above approach in two main ways. First, Active Inference does not consider utility maximization alone but the broader objective of (expected) free energy minimization, which also includes additional (epistemic) imperatives, such as the disambiguation of current state and novelty seeking (see figure 2.5). These additional objectives are sometimes added on to classical rewards—for example, as a “novelty bonus” (Kakade and Dayan 2002) or “intrinsic reward” (Schmidhuber 1991, Oudeyer et al. 2007, Baldassarre and Mirolli 2013, Gottlieb et al. 2013)—but they arise automatically in Active Inference, enabling it to resolve explorationexploitation trade-offs implicit in many decisions. The reason for this is that free energies are functionals of beliefs, which means we are in the realm of belief optimization as opposed to external reward functions. This is essential in explorative problems, wherein success depends on resolving as much uncertainty as possible.

Second, in Active Inference, the notion of cost is absorbed into the prior. The prior (or prior preference) specifies an objective for control—for example, a trajectory to follow or an endpoint to reach. Using priors to encode preferred observations (or sequences) may be more expressive than using utilities (Friston, Daunizeau, and Kiebel 2009). Using this method, finding the optimal policy is recast as a problem of inference (of a sequence of control states that realize the preferred trajectory) and does not require a value function or the Bellman equation—although can appeal to a similar recursive logic (Friston, Da Costa et al. 2020). There are at least two fundamental differences between the ways priors and value functions are normally used in Active Inference and RL, respectively. First, RL methods use value functions of states or of state-action pairs—whereas Active Inference uses priors over observations. Second, value functions are defined in terms of the expected return of being in a state (or performing an action in a state) following a specific policy—that is, the sum of future (discounted) rewards obtained by starting in the state and then executing the policy. In contrast, in Active Inference, priors do not usually sum future rewards, nor do they discount them. Rather, something analogous to the expected return only emerges in Active Inference when the expected free energy of a policy is calculated. The implication is that expected free energy is the closest analogue to the value function. However, even this differs in the sense that expected free energy is a functional of beliefs about states, not a function of states. Having said this, it is possible to construct priors that resemble value functions of states in RL—for example, by caching expected free energy calculations in these states (Friston, FitzGerald et al. 2016; Maisto, Friston, and Pezzulo 2019).

Furthermore, absorbing the notion of utility into the prior has an important theoretical consequence: priors play the role of goals and render the generative model biased—or optimistic, in the sense that the creature believes it will encounter preferred outcomes. It is this optimism that underwrites inferred plans that achieve desired outcomes in Active Inference; a failure of this sort of optimism may correspond to apathy (Hezemans et al. 2020). This stands in contrast with other formal approaches to decisionmaking, such as Bayesian decision theory, which separate the probability of events from their utility. Having said this, this distinction is somewhat superficial, as a utility function can always be rewritten as encoding a prior belief, consistent with the fact that behaviors that maximize a utility function are a priori (and by design) more probable. From one (slightly tautological) deflationary perspective, this is the definition of utility.

10.7.1 Bayesian Decision Theory

Bayesian decision theory is a mathematical framework that extends the ideas of the Bayesian brain (discussed above) to the domains of decision-making, sensorimotor control, and learning (Kording and Wolpert 2006, Shadmehr et al. 2010, Wolpert and Landy 2012). Bayesian decision theory describes decision-making in terms of two distinct processes. The first process uses Bayesian computations to predict the probability of future (action- or policydependent) outcomes, and the second process defines the preference over plans, using a (fixed or learned) utility or cost function. The final decision (or action selection) process integrates both streams, thus selecting (with higher probability) the action plan that has the higher probability of yielding the higher reward. This stands in contrast to Active Inference, in which the prior distribution directly signals what is valuable for the organism (or what has been valuable during evolutionary history). However, parallels could be drawn between the two streams of Bayesian decision theory and the optimization of variational and expected free energy, respectively. Under Active Inference, the minimization of variational free energy affords accurate (and simple) beliefs about the state of the world and its likely evolution. The prior belief that expected free energy will be minimized through policy selection incorporates the notion of preferences.

In some circles, there are concerns about the status of Bayesian decision theory. This follows from the complete class theorems (Wald 1947, Brown 1981) that say for any given pair of decisions and cost functions, there exist some prior beliefs that render the decisions Bayes optimal. This means that there is an implicit duality or degeneracy when dealing separately with prior beliefs and cost functions. In one sense, Active Inference resolves this degeneracy by absorbing utility or cost functions into prior beliefs in the form of preferences.

10.7.2 Reinforcement Learning

Reinforcement learning (RL) is an approach to solving Markov decision problems that is popular in both artificial intelligence and the cognitive sciences (Sutton and Barto 1998). It focuses on how agents learn a policy (e.g., pole balancing strategy) by trial and error: by trying out actions (e.g., move to the left) and receiving positive or negative reinforcements, depending on action success (e.g., pole balanced) or failure (e.g., pole fallen).

Active Inference and RL address overlapping sets of problems but differ in many respects mathematically and conceptually. As noted above, Active Inference dispenses with the notions of reward, value functions, and Bellman optimality that are key to reinforcement learning approaches. Furthermore, the notion of policy is used differently in the two frameworks. In RL a policy denotes a set of stimulus-response mappings that need to be learned. In Active Inference, a policy is part of the generative model: it denotes a sequence of control states that need to be inferred.

Reinforcement learning approaches are plentiful, but they can be subdivided into three main families. The first two methods try to learn good (state or state-action) value functions, albeit in two different ways.

Model-free methods of RL learn value functions directly from experience: they perform actions, collect rewards, update their value functions, and use them to update their policies. The reason they are called model-free is because they do not use a (transition) model that permits predicting future states—of the sort used in Active Inference. Instead, they implicitly appeal to simpler kinds of models (e.g., state-action mappings). Learning value functions in model-free RL often involves computing reward prediction errors, as in the popular temporal-difference rule. While Active Inference often appeals to prediction errors, these are state prediction errors (as there is no notion of reward in Active Inference).

Model-based methods of RL do not learn value functions or policies directly from experience. Rather, they learn a model of the task from experience, use the model to plan (simulate possible experiences), and update value functions and policies from these simulated experiences. While both Active Inference and reinforcement learning appeal to model-based planning, they use it differently. In Active Inference, planning is the computation of the expected free energy for each policy, not a means to update value functions. Arguably, if the expected free energy is seen as a value functional, it could be said that inferences drawn using the generative model are used to update this functional—offering a point of analogy between these approaches.

The third family of RL approaches, policy gradient methods, tries to optimize policies directly, without intermediate value functions, which are central to both model-based and model-free RL. These methods start from parameterized policies, able to generate (for example) movement trajectories, and then optimizes them by changing the parameters to increase (decrease) the likelihood of a policy if the trajectory results in a high (low) positive reward. This approach relates policy gradient methods to Active Inference, which also dispenses with value functions (Millidge 2019). However, the general objective of policy gradients (maximizing long-term cumulative reward) differs from Active Inference.

Besides the formal differences between Active Inference and RL, there are also several important conceptual differences. One difference regards how the two approaches interpret goal-directed and habitual behavior. In the animal learning literature, goal-directed choices are mediated by the (prospective) knowledge of the contingency between an action and its outcome (Dickinson and Balleine 1990), whereas habitual choices are not prospective and depend on simpler (e.g., stimulus-response) mechanisms. A popular idea in RL is that goal-directed and habitual choices correspond to model-based and model-free RL, respectively, and that these are acquired in parallel and continuously compete to control behavior (Daw et al. 2005).

Active Inference instead maps goal-directed and habitual choices to different mechanisms. In Active Inference (in discrete time), policy selection is quintessentially model-based and hence fits the definition of goal-directed, deliberative choices. This is similar to what happens in model-based RL, but with a difference. In model-based RL, actions are selected in a prospective manner (using a model) but are controlled in a reactive way (using stimulusresponse policies); in Active Inference, actions can be controlled in a proactive way—through fulfilling proprioceptive predictions (on action control, see section 10.6).

In Active Inference, habits can be acquired by executing goal-directed policies and then caching information about which policies are successful in which contexts. The cached information can be incorporated as a prior value of policies (Friston, FitzGerald et al. 2016; Maisto, Friston, and Pezzulo 2019). This mechanism permits executing policies that have a high prior value (in a given context) without deliberation. This can be thought of simply as observing “what I do” and learning that “I am the sort of creature that tends to do this” over multiple exposures to a task. In contrast to model-free RL, where habits are acquired independently of goal-directed policy selection, in Active Inference habits are acquired by repeatedly pursuing goal-directed policies (e.g., by caching their results).

In Active Inference, goal-directed and habitual mechanisms can cooperate rather than only compete. This is because the prior belief over policies depends on both a habitual term (a prior value of policies) and a deliberative term (expected free energy). Hierarchical elaborations of Active Inference suggest that reactive and goal-directed mechanisms could be arranged in a hierarchy rather than as parallel pathways (Pezzulo, Rigoli, and Friston 2015).

Finally, it is worth noting that Active Inference and RL differ subtly in how they conceive behavior and its causes. RL originates from behaviorist theory and the idea that behavior results from trial-and-error learning mediated by reinforcement. Active Inference assumes instead that behavior is the result of an inference. This leads us to the next point.

10.7.3. Planning as Inference

In the same way that it is possible to cast perceptual problems as problems of inference, it is also possible to cast control problems in terms of (approximate) Bayesian inference (Todorov 2008). In keeping with this, in Active Inference, planning is seen as an inferential process: the inference of a sequence of control states of the generative model.

This idea is closely related to other approaches, which include control-asinference (Rawlik et al. 2013, Levine 2018), planning-as-inference (Attias 2003, Botvinick and Toussaint 2012), and risk-sensitive and KL control (Kappen et al. 2012). In these approaches, planning proceeds through inferring a posterior distribution over actions, or sequences of actions, using a dynamic generative model that encodes probabilistic contingencies between states, actions, and future (expected) states. The best action or plan can be inferred by conditioning the generative model on observing future rewards (Pezzulo and Rigoli 2011, Solway and Botvinick 2012) or optimal future trajectories (Levine 2018). For example, it is possible to clamp (i.e., fix the value of) the future desired state in the model and then infer the sequence of actions that is more likely to fill the gap from the current state to the future desired state.

Active Inference, planning-as-inference, and other related schemes use a prospective form of control, which starts from an explicit representation of future, to-be-observed states rather than from a set of stimulus-response rules or policies, as is more typically done in optimal control theory and RL. However, the specific implementations of control- and planning-asinference vary along at least three dimensions—namely, what form of inference they use (e.g., sampling or variational inference), what they infer (e.g., a posterior distribution over actions or action sequences), and the goal of inference (e.g., maximizing the marginal likelihood of an optimality condition or the probability of getting reward).

Active Inference takes a unique perspective on each of these dimensions. First, it uses a scalable approximate scheme—variational inference—to solve the challenging computational problems that arise during planning-asinference. Second, it affords model-based planning, or the inference of a posterior over control states—which correspond to action sequences or policies, not single actions.” Third, to infer action sequences, Active Inference considers the expected free energy functional, which mathematically subsumes other widely used planning-as-inference schemes (e.g., KL control) and can handle ambiguous situations (Friston, Rigoli et al. 2015).

10.8 Behavior and Bounded Rationality

The wise are instructed by reason, average minds by experience, the stupid by necessity and the brute by instinct.

—Marcus Tullius Cicero

Behavior in Active Inference automatically combines multiple components: deliberative, perseverative, and habitual (Parr 2020). Imagine a person who is walking to a shop close to her house. If she predicts the consequences of her actions (e.g., turning left or right), she can elaborate a good plan to reach the shop. This deliberative aspect of behavior is provided by expected free energy, which is minimized when one acts in a way to achieve preferred observations (e.g., being in the shop). Note that expected free energy also includes a drive to reduce uncertainty, which can manifest in deliberation. For example, if the person is unsure about the best direction, she can move to an appropriate vantage point, from which she can find the way to the shop easily, even if this implies a longer route. In short, her plans acquire epistemic affordance.

If the person is less able to engage in deliberation (e.g., because she is distracted), she may continue walking after reaching the shop. This perseverative aspect of behavior is provided by variational free energy, which is minimized when one gathers observations that are compatible with current beliefs, including beliefs about the current course of actions. The sensory and proprioceptive observations that the person gathers provide evidence for “walking” and hence may determine perseveration in the absence of deliberation.

Finally, another thing the person could do—when she is less able to deliberate—is select the usual plan to go home, without thinking about it. This habitual component is provided by the prior value of policies. This could allocate high probability to a plan to go home—a plan she has observed herself enacting multiple times in the past—and can become dominant if not superseded by deliberation.

Note that deliberative, perseverative, and habitual aspects of behavior coexist and can be combined in Active Inference. In other words, one can infer that, in this situation, a habit is the most likely course of action. This is different from “dual theories,” which assume that we are driven by two separate systems, one rational and one intuitive (Kahneman 2017). The mixture of deliberative, perseverative, and habitual aspects of behavior plausibly depends on contextual conditions, such as the amount of experience and the amount of cognitive resources one can invest in deliberative processes that may have a high complexity cost.*

The impact of cognitive resources on decision-making has been widely studied under the rubric of bounded rationality (Simon 1990). The core idea is that while an ideal rational agent should always fully consider the outcomes of its actions, a bounded rational agent has to balance the costs, effort, and timeliness of computation—for example, the information-processing costs of deliberating the best plan (Todorov 2009, Gershman et al. 2015).

10.8.1 Free Energy Theory of Bounded Rationality

Bounded rationality has been cast in terms of Helmholtz free energy minimization: a thermodynamic construct that is strictly related to the notion of variational free energy as used in Active Inference; see Gottwald and Braun (2020) for details. The “free energy theory of bounded rationality” formulates the trade-offs of action selection with limited information-processing capabilities in terms of two components of free energy: energy and entropy (see chapter 2). The former represents the expected value of a choice (an accuracy term), and the latter represents the costs of deliberation (a complexity term). What is costly during deliberation is decreasing the entropy (or complexity) of one’s beliefs before a choice to render them more precise (Ortega and Braun 2013, Zénon et al. 2019). Intuitively, the choice would be more accurate (and potentially entail higher utility) with a more precise posterior belief, but because increasing the precision of beliefs has a cost, a bounded decision-maker has to find a compromise—by minimizing free energy. The same trade-offs emerge in Active Inference, thus producing forms of bounded rationality. The notion of bounded rationality also resonates with the use of a variational bound on evidence (or marginal likelihood) that is a definitive aspect of Active Inference. In sum, Active Inference provides a model of (bounded) rationality and optimality, where the best solution to a given problem results from the compromise between complementary objectives: accuracy and complexity. These objectives stem from a normative (free energy minimization) imperative that is richer than classical objectives (e.g., utility maximization) usually considered in economic theory.

10.9 Valence, Emotion, and Motivation

Consider your origins: you were not made to live as brutes, but to follow virtue and knowledge.

—Dante Alighieri

Active Inference focuses on (negative) free energy as a measure of fitness and the capacity of an organism to realize its goals. While Active Inference proposes that creatures act to minimize their free energy, this does not mean that they ever have to compute it. Generally, it is sufficient to deal with the gradients of the free energy. By analogy, we do not need to know our altitude to find the top of a hill but can simply follow the slope upward. However, some have suggested creatures may model how their free energy changes over time. Proponents of this hypothesis suggest that it might permit characterizations of phenomena like valence, emotion, and motivation.

On this view, it has been proposed that emotional valence, or the positive or negative character of emotions, can be conceived as the rate of change (first time-derivative) of free energy over time (Joffily and Coricelli 2013). Specifically, when a creature experiences an increase in its free energy over time, it may assign a negative valence to the situation; whereas when it experiences a decrease of its free energy over time, it may assign it a positive valence. Extending this line of thought to long-term dynamics of free energy (and second time-derivatives), it may be possible to characterize sophisticated emotional states; for example, the relief of passing from a phase of low valence to a phase of high valence, or the disappointment of passing from a phase of high valence to a phase of low valence. Monitoring free energy dynamics (and the emotional states they elicit) may permit adapting the behavioral strategies or learning rates to long-term environmental statistics.

It may seem a bit of a leap to assume a second generative model whose role is to monitor the free energy of the first. However, there is another way in which these ideas can be interpreted. An interesting formalization of these perspectives rests on thinking about what causes rapid changes in free energy. As it is a functional of beliefs, a rapid change in free energy must be due to fast belief updating. The key determinant of this speed is precision, which acts as a time-constant in the dynamics of predictive coding. Interestingly, this ties in with the notion of higher derivatives of the free energy, as precision is the negative of the second derivative (i.e., the curvature of a free energy landscape). However, this begs the question as to why we should associate precision with valence. The answer comes from noticing that precision is inversely related to ambiguity. The more precise something is, the less ambiguous its interpretation. Choosing a course of action that minimizes expected free energy also means minimizing ambiguity and therefore maximizing precision. Here we see a direct association between high order derivatives of the free energy, its rate of change, and motivated behavior.

Expectations about (increases or decreases of) free energy may play moti-vational roles and incentivize behavior, too. In Active Inference, a surrogate expectation about changes (increases or decreases) of free energy is the precision of beliefs about policies. This again highlights the importance of this second order statistic. For example, a highly precise belief signals that one has found a good policy—that is, a policy that can be confidently expected to minimize free energy. Interestingly, the precision of (beliefs about) policies has been linked to dopamine signaling (FitzGerald, Dolan, and Friston 2015). From this perspective, stimuli that increase the precision of beliefs about policies trigger dopamine bursts—which may indicate their incentive salience (Berridge 2007). This perspective may help shed light on the neurophysiological mechanisms linking expectations of goal or reward achievement to increases in attention (Anderson et al. 2011) and motivation (Berridge and Kringelbach 2011).

10.10 Homeostasis, Allostasis, and Interoceptive Processing

There is more wisdom in your body than in your deepest philosophy.

—Friedrich Nietzsche

A creature’s generative model is not just about the external world but also— and perhaps even more importantly—about the internal milieu. A generative model of a body’s inside (or interoceptive schema) has a dual role: to explain how interoceptive (bodily) sensations are generated and to ensure the correct regulation of physiological parameters (Iodice et al. 2019), like body temperature or sugar levels in the blood. Cybernetic theories (touched on in section 10.6.2) assume that a central objective of living organisms is maintaining homeostasis (Cannon 1929)—ensuring that physiological parameters remain within viable ranges (e.g., body temperature never becomes too high)—and that homeostasis can only be achieved by exerting a successful control over the environment (Ashby 1952).

This form of homeostatic regulation can be achieved in Active Inference by specifying the viable ranges of physiological parameters as priors over interoceptive observations. Interestingly, homeostatic regulation can be achieved in multiple, nested ways. The simplest regulatory loop is the engagement of autonomic reflexes (e.g., vasodilation), when certain parameters are (expected to be) out of range—for example, when body temperature is too high. This autonomic control can be constructed as interoceptive inference: an Active Inference process that operates on interoceptive streams rather than proprioceptive streams, as in the case of externally directed actions (Seth et al. 2012, Seth and Friston 2016, Allen et al. 2019). For this, the brain may use a generative model that predicts interoceptive and physiological streams and triggers autonomic reflexes to correct interoceptive prediction errors (e.g., a surprisingly high body temperature). This is analogous to the way motor reflexes are activated to correct proprioceptive prediction errors and steer externally directed actions.

Active Inference extends beyond simple autonomic loops: it can correct the same interoceptive prediction error (high body temperature) in increasingly sophisticated ways (Pezzulo, Rigoli, and Friston 2015). It can use predictive, allostatic strategies (Sterling 2012, Barrett and Simmons 2015, Corcoran et al. 2020) that go beyond homeostasis and preemptively control physiology in an allostatic fashion before interoceptive prediction errors are triggered—for example, finding shade before overheating. Another predictive strategy entails mobilizing resources before expected excursions from physiological setpoints—for example, increasing cardiac output before a long run in anticipation of increased oxygen demands. That requires modifying the priors over interoceptive observations dynamically, going beyond homeostasis (Tschantz et al. 2021). Eventually, predictive brains can develop sophisticated goal-directed strategies, such as ensuring that one brings cold water to the beach, meeting the same imperative (controlling body temperature) in richer and more effective ways.

Biological and interoceptive regulation may be crucial for affect and emotional processing (Barrett 2017). During situated interactions, the brain’s generative model constantly predicts not just what will happen next but also what the consequences for interoception and allostasis are. Interoceptive streams—elicited during the perception of external objects and events— imbue them with an affective dimension, which signals how good or bad they are for the creature’s allostasis and survival, hence making them “meaningful.” If this view is correct, then disorders of this interoceptive and allostatic processing may engender emotional dysregulation and various psychopathological conditions (Pezzulo 2013; Barrett et al. 2016; Maisto, Barca et al. 2019; Pezzulo, Maisto et al. 2019).

There is an emerging bedfellow for interoceptive inference—namely, emotional inference. In this application of Active Inference, emotions are considered part of the generative model: they are just another construct or hypothesis that the brain employs to deploy precision in deep generative models. From the perspective of belief updating, this means anxiety is just a commitment to the Bayesian belief “I am anxious” that best explains the prevailing sensory and interoceptive queues. From the perspective of acting, the ensuing (interoceptive) predictions augment or attenuate various precisions (i.e., covert action) or enslave autonomic responses (i.e., overt action). This may look much like arousal, which confirm the hypothesis that “I am anxious.” Usually, emotional inference entails belief updating that is domain general, assimilating information from both interoceptive and exteroceptive sensory streams—hence the intimate relationship between emotion, interoception, and attention in health (Seth and Friston 2016; Smith, Lane et al. 2019; Smith, Parr, and Friston 2019) and disease (Peters et al. 2017, J. E. Clark et al. 2018).

10.11 Attention, Salience, and Epistemic Dynamics

True ignorance is not the absence of knowledge, but the refusal to acquire it.

—Karl Popper

Given the number of times we have referred to precision and expected free energy in this chapter alone, it would be negligent not to devote a little space to attention and salience. These concepts recur throughout psychology, having been subject to numerous redefinitions and classifications. Sometimes these terms are used to refer to synaptic gain control mechanisms (Hillyard et al. 1998), which preferentially select some sensory modality or subset of channels within a modality. Sometimes they refer to how we orient ourselves, through overt or covert action, to gain more information about the world (Rizzolatti et al. 1987; Sheliga et al. 1994, 1995).

Although the uncertainty afforded by the many meanings of attention underwrites some of the epistemic attractiveness of this field of study, there is also value in resolving the attendant ambiguity. One of the things offered by a formal perspective on psychology is that we do not need to worry about this ambiguity. We can operationally define attention as the precision associated with some sensory input. This neatly maps to the concept of gain control, as sensations we infer to be more precise will have greater influence over belief updating than those inferred to be imprecise. The construct validity of this association has been demonstrated in relation to psychological paradigms, including the famous Posner paradigm (Feldman and Friston 2010). Specifically, responding to a stimulus at a location in visual space that is afforded a higher precision is faster than responding to stimuli in other locations.

This leaves the term salience in want of a similar formal definition. Typically, in Active Inference, we associate salience with expected information gain (or epistemic value): a component of the expected free energy. Intuitively, something is more salient when we expect it to yield more information. However, this defines salience in terms of an action or policy, while attention is an attribute of beliefs about sensory input. This fits with the notion of salience as overt or covert orienting. We saw in chapter 7 that we could further subdivide expected information gain into salience and novelty. The former is the potential to infer, while the latter is the potential to learn. An analogy that expresses the difference between attention and salience (or novelty) is the design and analysis of a scientific experiment. Attention is the process of selecting the highest quality data from what we have already measured and using these to inform our hypothesis testing. Salience is the design of the next experiment to ensure the highest quality data.

We do not dwell on this issue to simply add another reclassification of attentional phenomena to the literature but to highlight an important advantage in committing to a formal psychology. Under Active Inference, it does not matter if others define attention (or any other construct) differently—as we can simply refer to the mathematical constructs in question and preclude any confusion. A final point of consideration is that these definitions offer a simple explanation for why attention and salience are so often conflated. Highly precise data are minimally ambiguous. This means that they should be afforded attention and that actions to acquire these data are highly salient (Parr and Friston 2019a).

10.12 Rule Learning, Causal Inference, and Fast Generalization

Yesterday I was clever, so I wanted to change the world. Today I am wise, so I am changing myself.

—Rumi

Humans and other animals excel at making sophisticated causal inferences, learning abstract concepts and the causal relationships between objects, and generalizing from limited experience—in contrast to current machine learning paradigms, which require a large number of examples to attain similar performance. This difference suggests that current machine learning approaches, which are largely based on sophisticated pattern recognition, may not fully capture the ways humans learn and think (Lake et al. 2017).

The learning paradigm of Active Inference is based on the development of generative models that capture the causal relations between actions, events, and observations. In this book, we have considered relatively simple tasks (e.g., the T-maze example of chapter 7) that require unsophisticated generative models. In contrast, understanding and reasoning about complex situations require deep generative models that capture the latent structure of the environment—such as hidden regularities that permit generalizing across a number of apparently dissimilar situations (Tervo et al. 2016; Friston, Lin et al. 2017).

One simple example of a hidden rule that governs sophisticated social interactions is a traffic intersection. Imagine a naive person who observes a busy crossroad and has to predict (or explain) on which occasions pedestrians or cars cross the road. The person can accumulate statistics about the co-occurrence of events (e.g., a red car stopping and a tall man crossing; an old woman stopping and a big car passing), but most are ultimately useless. The person can eventually discover some recurrent statistical patterns, such as that pedestrians cross the road soon after all cars stop at a certain point on the road. This determination would be deemed sufficient in a machine learning setting if the task were just to predict when pedestrians are about to walk, but it would not entail any understanding of the situation. In fact, it may even lead to the erroneous conclusion that the stopping of cars explains the movement of pedestrians. This sort of error is typical in machine learning applications that do not appeal to (causal) models—and cannot distinguish whether the rain explains the wet grass or the wet grass explains the rain (Pearl and Mackenzie 2018).

On the other hand, inferring the correct hidden (e.g., traffic light) rule provides a deeper understanding of the causal structure of the situation (e.g., it is the traffic light that causes the cars to stop and the pedestrians to walk). The hidden rule not only affords better predictive power but also renders inference more parsimonious, as it can abstract away from most sensory details (e.g., the color of cars). In turn, this permits generalizing to other situations, such as different crossroads or cities, where most sensory details differ significantly—with the caveat that facing crossroads in some cities, like Rome, may require more than looking at traffic lights. Finally, learning about traffic light rules may also enable more efficient learning in novel situations—or to develop what is called a “learning set” in psychology or a learning-to-learn ability in machine learning (Harlow 1949). When facing a crossroad where the traffic light is off, one cannot use the learned rule but may nevertheless have the expectation that there is another, similar hidden rule in play—and this could help understanding what the traffic police officer is doing.

As this simple example illustrates, learning rich generative models—of the latent structure of the environment (aka structure learning)—affords sophisticated forms of causal reasoning and generalization. Scaling up generative models to address these sophisticated situations is an ongoing objective in computational modeling and cognitive science (Tenenbaum et al. 2006, Kemp and Tenenbaum 2008). Interestingly, there is a tension between current machine learning trends—wherein the general idea is “the bigger, the better”—and the statistical approach of Active Inference—which suggests the importance of balancing the accuracy of a model with its complexity and to favor simpler models. Model reduction (and the pruning of unnecessary parameters) is not simply a way to avoid wasting resources—it is also an effective way to learn hidden rules, including during offline periods like sleep (Friston, Lin et al. 2017), perhaps manifesting in resting state activity (Pezzulo, Zorzi, and Corbetta 2020).

10.13 Active Inference and Other Fields: Open Directions

It has to start somewhere, it has to start sometime, what better place than here? What better time than now?

—Rage Against the Machine, “Guerrilla Radio”

In this book, we mainly focus on Active Inference models that address biological problems of survival and adaptation. Yet Active Inference can be applied in many other domains. In this last section, we briefly discuss two such domains: social and cultural dynamics and machine learning and robotics. Addressing the former requires thinking about the ways in which multiple Active Inference agents interact and the emergent effects of such interaction. Addressing the latter requires understanding how Active Inference can be endowed with more effective learning (and inference) mechanisms to scale up to more complex problems—but in a way that is compatible with the basic assumptions of the theory. Both are interesting open directions for research.

10.13.1 Social and Cultural Dynamics

Many interesting aspects of our (human) cognition relate to social and cultural dynamics rather than individualistic perceptions, decisions, and actions (Veissiere et al. 2020). By definition, social dynamics require multiple Active Inference creatures that engage in physical interactions (e.g., joint actions, such as playing team sports) or more abstract interactions (e.g., elections or social networking). Simple demonstrations of inter-Active Inference between identical organisms already produced interesting emergent phenomena, such as the self-organization of simple life forms that resist dispersion, the possibility to engage in morphogenetic processes to acquire and restore a body form, and mutual coordinated prediction and turn taking (Friston 2013; Friston and Frith 2015a; Friston, Levin et al. 2015). Other simulations have addressed the ways in which creatures can extend their cognition to material artifacts and shape their cognitive niches (Bruineberg et al. 2018).

These simulations capture only a fraction of the complexity of our social and cultural dynamics, but they illustrate the potential of Active Inference to expand from a science of individuals to a science of societies—and how cognition extends beyond our skulls (Nave et al. 2020).

10.13.2 Machine Learning and Robotics

The generative modeling and variational inference methods discussed in this book are widely used in machine learning and robotics. In these fields, the emphasis is often on how to learn (connectionist) generative models— as opposed to how to use them for Active Inference, the focus of this book. This is interesting as machine learning approaches are potentially useful to scale up the complexity of the generative models and of the problems considered in this book—with the caveat that they may call on very different process theories of Active Inference.

While it is impossible to review here the vast literature on generative modeling in machine learning, we briefly mention some of the most popular models, from which many variants have been developed. Two early connectionist generative models, the Helmholtz machine and the Boltzmann machine (Ackley et al. 1985, Dayan et al. 1995), provided paradigmatic examples of how to learn the internal representations of a neural network in an unsupervised way. The Helmholtz machine is especially related to the variational approach of Active Inference, as it uses separate recognition and generative networks to infer a distribution over hidden variables and sample from them to obtain fictive data. The early practical success of these methods was limited. But afterward, the possibility to stack multiple (restricted) Boltzmann machines enabled learning of multiple layers of internal representations and was one of the early successes of unsupervised deep neural networks (Hinton 2007).

Two recent examples of connectionist generative models, variational autoencoders or VAEs (Kingma and Welling 2014) and generative adversarial networks or GANs (Goodfellow et al. 2014), are widely used in machine learning applications, such as recognizing or generating pictures and videos. VAEs exemplify an elegant application of variational methods to learning in generative networks. Their learning objective, the evidence lower bound (ELBO), is mathematically equivalent to variational free energy. This objective enables learning of an accurate description of the data (i.e., maximizes accuracy) but also favors internal representations that do not differ too much from their priors (i.e., minimizes complexity). The latter objective acts as a so-called regularizer, which helps to generalize and avoid overfitting.

GANs follow a different approach: they combine two networks, a generative network and a discriminative network, which continuously compete during learning. The discriminative network learns to distinguish which example data produced by the generative network are real or fictive. The generative network tries to generate fictive data that fool (i.e., are misclassified by) the discriminative network. The race between these two networks forces the generative network to improve its generative capabilities and produce high fidelity fictive data—an ability that has been widely exploited to generate, for example, realistic images.

The above generative models (and others) can be used for control tasks. For example, Ha and Eck (2017) have used a (sequence-to-sequence) VAE to learn to predict pencil strokes. By sampling from the internal representation of the VAE, the model can construct novel stroke-based drawings. Generative modeling approaches have been used to control robot movements, too. Some of these approaches use Active Inference (Pio-Lopez et al. 2016, Sancaktar et al. 2020, Ciria et al. 2021) or closely related ideas, but in a connectionist setting (Ahmadi and Tani 2019, Tani and White 2020).

One of the main challenges in this domain is that robot movements are high dimensional and require (learning) sophisticated generative models. One interesting aspect of Active Inference and related approaches is that the most important thing to be learned is a forward mapping between actions and sensory (e.g., visual and proprioceptive) feedback at the next time step. This forward mapping can be learned in various ways: by autonomous exploration, by demonstration, or even by direct interaction with a human—for example, a teacher (the experimenter) who guides the hands of the robot along a trajectory to the goal, hence scaffolding the acquisition of effective goal-directed actions (Yamashita and Tani 2008). The possibility to learn generative models in various ways greatly expands the scope of robot skills that can be eventually achieved. In turn, the possibility to develop more advanced (neuro-) robots using Active Inference could be important not just for technological but also for theoretical reasons. Indeed, some key aspects of Active Inference, such as the adaptive agent-environment interactions, the integration of cognitive functions, and the importance of embodiment, are naturally addressed in robotic settings.

10.14 Summary

Home is behind, the world ahead, and there are many paths to tread through shadows to the edge of night, until the stars are all alight.

—J. R. R. Tolkien, The Lord of the Rings

We started this book by asking whether it is possible to understand brain and behavior from first principles. We then introduced Active Inference as a candidate theory to meet this challenge. We hope that the reader has been convinced that the answer to our original question is yes. In this chapter, we considered the unified perspective that Active Inference offers on sentient behavior and what implications this theory has for familiar psychological constructs, such as perception, action selection, and emotion. This gave us the opportunity to revisit the concepts introduced throughout the book and to remind ourselves of the fascinating questions still open for future research. We hope this book provides a useful complement to related works on Active Inference, including on the one hand the philosophy (Hohwy 2013, Clark 2015) and on the other hand the physics (Friston 2019a).

Subscribe to Jim
Receive the latest updates directly to your inbox.
Mint this entry as an NFT to add it to your collection.
Verification
This entry has been permanently stored onchain and signed by its creator.
More from Jim

Skeleton

Skeleton

Skeleton