The Guessing Game
In the previous post, we explored the physical reality of information through Maxwell’s Demon. We established that information is not abstract; it is physical. Reducing uncertainty (sorting the gas) requires energy, and erasing information releases heat.
But how does the Demon, or any intelligent observer, decide how to sort the world? If we are observing a system with billions of microscopic configurations but can only measure a few macroscopic facts (like average energy or any other data), how do we construct an honest model of that system?
We need a strategy. We need to play a Guessing Game.
In this post, we move from the physics of the Demon to the mathematics of Maximum Entropy. We will define the rules of this game, exploring how we can build probability distributions that are “maximally honest” about what we don’t know, while strictly adhering to the constraints of what we do know.
Imagine a world that can be in many microscopic configurations, which we’ll call (x). You play a game: at each round, you must guess its configuration. Your score is your long-run accuracy.
To do well, you need a probability distribution (p(x)): how likely each configuration is.
Probabilities are meaningless in a vacuum. They only make sense relative to a set of possibilities: the space of configurations (x). Once we specify that space, a natural question is: if we watched this world forever, what fraction of time would it spend in each configuration? That long-run fraction is what we’d like (p(x)) to capture.
If we knew absolutely nothing else, one honest option would be to treat every configuration as equally likely. This is the Principle of Indifference: If no configuration is privileged by your information, treat them equally. But in reality we almost always know something, e.g., some coarse summary of the world, like its average energy. These are a coarse fact about the world that doesn’t tell you exactly which (x) occurred, but does rule out many distributions. In other words, we know some macroscopic constraints on functions of (x), such as: \[\mathbb E_{x \sim p}[f_k(x)] = c_k.\]
A common textbook approach is to impose an energy constraint on the distribution. For now, we’ll keep things simple and work with a single constraint (we’ll generalise this framework later): \[\mathbb E_{x \sim p}[E(x)] = \bar E.\]
Now the problem becomes:
Choose a distribution (p(x)) that (1) satisfies the constraints we know, and (2) builds in as few extra assumptions as possible.
“As few extra assumptions as possible” is what the Maximum Entropy principle formalizes. The entropy is defined as the average surprise or uncertainty of a system. \[S[p] = E_{x \sim p}[Surprise(p(x))]\] is the average surprise under probability distribution (p). We’ll treat entropy more rigorously in the sections to follow. Broadly, entropy measures how “uninformative” or “unassuming” your distribution is:
- High entropy means many possibilities remain open; you’re not pretending to know too much.
- Low entropy means you’re committing strongly to particular configurations.
Maximizing entropy subject to constraints means: among all distributions that fit what we know, pick the one that leaves us maximally uncertain or maximally honest about our ignorance: (1) You encode exactly the information you have (the constraints), and (2) You do not fabricate any structure you don’t know. These represent two different philosophies about reality: Ontology (what the world is) vs. Epistemology (what we know about the world).
Winning the Game
Mathematically, MaxEnt is a constrained optimization problem: \[\max_{p} S[p]\quad\text{s.t.}\quad\mathbb E_p[f_k(x)] = c_k,\ \sum_x p(x)=1.\]
Using Lagrange multipliers, we convert this to an unconstrained problem (also called as Lagrangian): \[\max_{p} S[p] - \sum_k \lambda_k \mathbb E_p[f_k(x)] - \lambda_0\left(\sum_x p(x)-1\right).\]
For the special case where the only constraint is average energy, \[\mathbb E_p[E(x)] = \bar E,\] we get:\[\max_{p} \Big(S[p] - \beta \mathbb E_p[E(x)]\Big),\] where \(\beta\) is the Lagrange multiplier enforcing the energy constraint and we ignore \(\lambda_0\) as we can handle normalization explictly.
The Real Game
The same principle can be written as minimization of Free Energy, \(F[p] = E_p[E(x)] - T S[p]\) \(\min_p F[p].\) This quantity, \(F = E - T S\) (Energy minus Temperature \(\times\) Entropy), sits at the heart of many pillars of science. It governs equilibrium in physics, inference in information theory, learning in biology, and even value in economics.
Let’s consider Maxwell’s Demon again.
Statistical Mechanics: Maxwell proposed a famous thought experiment:
- An enclosed container of gas, with atoms moving randomly.
- A partition separates chamber A and chamber B, connected by a tiny, frictionless door operated by a “Demon.”
- The Demon watches the molecules and records their speeds in its memory. It opens the door to let fast molecules into one chamber and slow molecules into the other.
This sorting creates a temperature gradient (Order). By segregating the gas, the Demon has lowered the entropy of the gas without changing its total energy. Symbolically: \[\Delta S_{\text{gas}} < 0.\] Because Entropy drops, the Free Energy (the capacity to do work) rises: \[\Delta F_{\text{gas}} = - T (\Delta S_{\text{gas}}) > 0.\]
This gas is now a battery. We could run an engine on this gradient. Once the gas returns to equilibrium, the demon simply repeats the cycle: watch, sort, extract work. At first glance, this looks like a perpetual motion machine, which violates the Second Law of Thermodynamics.
Information: The catch lies in the Demon’s “scratchpad.” To sort molecules, the demon must record information. If the demon had an infinite memory, it could indeed act as a perpetual motion machine. But in reality, memory is finite. To keep sorting indefinitely, the demon must eventually erase old information to make room for new measurements: \[\text{(fast/slow) molecules} \longrightarrow 0.\] This erasure is a logically irreversible operation (two possible states collapse into one).
According to Landauer’s Principle, this operation has a physical cost. Erasure is not free. To reset the memory bits to 0:
- Work is Required: The demon (or the battery powering it) must perform work (import energy) on its physical memory system.
- Heat is Released: This work is dissipated as heat into the environment. You cannot erase information without generating heat.
Thus, at the moment of erasure, due to the dumped heat, the entropy (disorder) of the environment increases: \(\Delta S_{\text{env}} > 0\). This increase is strictly greater than (or at best equal to) the entropy reduction the demon achieved in the gas.
Thus, the entropy of the universe increases, which is how physicists explain the arrow of time, i.e., the direction of time is towards increasing entropy.
Information is Physical: Look at how the entropy of a physical system (the gas) is transformed into the entropy of information (the memory) and finally into heat (the environment).
Maxwell’s demon uses information about microscopic configurations to reduce the gas’s entropy and extract work. But the demon’s memory itself is a physical system. Each bit of ‘freedom’ removed from the gas costs work to erase later. Information and Energy meet in the same free-energy bookkeeping.
Machine Learning: In Artificial Intelligence, we use the same trade-off to learn from data without memorizing it. We want a model that represents the “Truth” (fits the data) but remains “Simple” (generalizes to new data).
- Energy (\(E\)): This is the Prediction Error. How badly does the model mismatch the data? We want to minimize this to be accurate.
- Entropy (\(S\)): This is the Regularization. We want to maximize this to prevent overfitting.
The “Perfect” model is the one that minimizes the Free Energy of the loss landscape, finding the sweet spot between accuracy (Cost) and generalization (Freedom).
Behavioral Economics: Finally, humans play this game when making decisions. We balance “Utility” (Energy) with “Exploration” (Entropy).
- Standard Economics (Zero Temperature): Assumes humans minimize Cost perfectly. We are rational robots who always pick the optimal option (\(\beta \to \infty\)).
- Bounded Rationality (Finite Temperature): In reality, humans operate at a finite temperature. We make mistakes, we explore, and we are noisy. This leads to the Softmax choice rule, which is exactly the Boltzmann distribution: \[P(\text{Choice}_i) \propto e^{\beta \cdot \text{Utility}_i}\]
However, sometimes humans violate standard entropy altogether. This is the domain of Prospect Theory. When we are afraid (Risk Averse) or gambling (Risk Seeking), we distort probabilities. We overweight rare events (like lottery wins) or ignore common risks.
Prateek Gupta