Most introductions to reinforcement learning start with the same pitch: an agent learns to maximise cumulative reward by interacting with an environment. The framing is optimisation. Find the policy that produces the highest return. Maximise the objective. Solve the MDP.
This framing is not wrong, but it is incomplete. It pulls your attention toward the machinery (the loss functions, the gradient updates, the convergence guarantees) and away from what RL is actually doing. At its core, reinforcement learning is not optimising a number. It is learning to make decisions. The optimisation is a by-product of making good decisions consistently, not the thing itself.
This is not even a novel reframing. It is closer to the original one. Bellman's work on dynamic programming was explicitly about sequential decision-making. Puterman's foundational textbook is titled Markov Decision Processes. The decision-making framing is what modern ML culture has drifted away from, not what it needs to invent.
How you frame RL changes how you design systems, how you choose where to apply it, and how you explain it to the people who need to trust it.
The optimisation lens and its blind spots
When we frame RL as optimisation, we inherit a set of assumptions that do not always serve us well.
The first is that there exists a single scalar quantity to maximise. In clean environments such as games, simulated benchmarks, and well-defined control tasks, this holds. But in real operational settings, the notion of a single reward signal is often a convenient fiction. A terminal planner is not maximising throughput. They are balancing throughput against equipment wear, safety margins, labour constraints, customer priorities, and a dozen other considerations that shift weight depending on context. Reducing this to a scalar reward is possible (reward engineering exists as a practice precisely because it is hard), but it flattens the actual decision problem into something it is not.
The second assumption is that the goal is to find the best policy. Optimality is the aspiration. However, what is truly desired is a policy that makes good decisions reliably, that degrades gracefully when the world shifts, and that can be understood well enough to be trusted. Framing RL as optimisation sets the bar at a place that is simultaneously too high and too narrow.
The third is subtler. When you think in terms of optimisation, you naturally compare RL against other optimisers such as mixed-integer programming, evolutionary strategies, and gradient-free search. And in that comparison, RL often looks inefficient. It needs millions of samples. It is sensitive to hyperparameters. It can be unstable. This comparison is valid if RL is just another optimiser. But if RL is something different, a framework for learning to decide, then the comparison misses the point entirely.
What changes when you think in terms of decisions
A decision is not the same thing as an optimisation step. A decision is a commitment made in context, under uncertainty, with consequences that unfold over time. Decisions are situated. They depend on what you know, what you do not know, and what you expect to learn. They are shaped by the history that led to the current moment and the future you are trying to navigate toward.
RL, when you strip away the mathematical machinery, is a framework for learning how to make these situated decisions well. Consider what an RL agent actually does at each timestep:
- It observes the current situation. Not the full state of the world, but just what is available to it. Partial observability is the norm, not the exception.
- It draws on past experience. Its policy is shaped by every trajectory it has seen, every consequence it has encountered. This is not memorisation. It is learned structure, compressed into parameters.
- It makes a choice. Not the provably optimal choice, but a choice that reflects a learned understanding of what tends to lead to good outcomes from situations like this one.
- It accepts the consequences. The world moves forward. New information arrives. The next decision is made in a new context shaped by the previous one.
This is not optimisation. This is decision-making. The value function is not an objective to be maximised. It is a learned estimate of how good a situation is. The policy is not an optimisation output. It is a decision rule, a compressed version of what the agent has learned about how to act.
Algorithms as decision-makers
This reframing changes how you read RL algorithms. Each one represents a different philosophy about how to make decisions.
Value-based methods such as DQN work by learning to evaluate situations. They estimate the expected return from each state-action pair, then pick the action with the highest estimate. The decision comes from the evaluation. If you want an analogy (and it is only an analogy) this resembles how an experienced operator assesses a situation: not by running a formal optimisation, but by pattern-matching against learned experience to recognise what good looks like.
Policy gradient methods such as REINFORCE, PPO, and A2C learn a mapping from states to action probabilities directly. They do not need to evaluate every option and pick the best one. The policy is parameterised and updated based on which actions led to better-than-expected outcomes. Mechanically, this is gradient ascent on expected return. But what it produces is a direct decision rule, a function from situations to actions, rather than a decision derived from value estimation.
Actor-critic methods combine both approaches. The critic learns a value function. The actor learns a policy. The critic provides a baseline that reduces variance in the actor's gradient estimates, and the actor updates its decision rule based on whether outcomes exceeded or fell short of the critic's expectations. The two components serve different roles (estimation and action selection) and their interaction produces more stable learning than either alone.
Model-based methods add planning. They learn or are given a model of environment dynamics and use it to simulate future trajectories before committing to an action. This is the most computationally expensive approach, but it allows the agent to reason about consequences explicitly, which is useful when the stakes are high and the environment is structured enough to model.
None of these are optimisers in the way that a linear programming solver is an optimiser. They are different strategies for making decisions under uncertainty, each with different trade-offs between speed, accuracy, sample efficiency, and robustness. Optimisation happens within them (gradient descent updates parameters, value iteration converges) but the optimisation serves the decision-making, not the other way around.
Why this matters in practice
This is not purely a philosophical exercise. In my experience, the framing you adopt has practical consequences for how you build and deploy systems. What follows is my perspective, not established research, but I have found these distinctions useful.
Problem selection. When you think of RL as optimisation, you ask: "Is this problem hard to optimise?" When you think of RL as decision-making, you ask: "Does this problem involve a sequence of decisions under uncertainty?" The second question identifies a much more useful set of problems. Some are easy to optimise but hard to decide in, because the decisions depend on unfolding context that no static optimiser can anticipate. Others are hard to optimise but do not actually involve decisions at all. They are combinatorial search problems better served by exact methods.
Reward design. Under the optimisation framing, reward design is about specifying the objective precisely. Under the decision-making framing, reward design is about giving the agent useful feedback on its decisions. These are related but distinct tasks. Neither framing escapes the fundamental difficulty: imprecise reward specifications can produce reward hacking, where agents exploit the gap between what you specified and what you meant. This is well-documented. Agents will find and exploit any misalignment between the reward signal and the intended behaviour. The decision-making framing does not solve this problem, but it does shift your attention from "did I specify the objective correctly?" toward "does this feedback reliably distinguish good decisions from bad ones in practice?" This can surface misalignments earlier, when you are evaluating the agent's actual choices rather than auditing a reward function in the abstract.
Evaluation. Optimisation asks: how close is this policy to optimal? Decision-making asks: does this agent make good decisions? In my experience, the second question tends to be more tractable and more relevant in applied settings. You can show an agent's decisions to a domain expert and ask whether they are sensible. You can compare them against what a competent human would do. You can measure whether the agent degrades gracefully under distribution shift. None of this requires knowing what optimal looks like, which in most real problems you do not.
Trust. This is more speculative, but I think the framing matters for how people engage with these systems. Telling an operator "this algorithm optimises your throughput" invites scrutiny of the objective, the model, and every assumption baked into the reward function. Telling them "this system has learned to make decisions based on thousands of operational scenarios" invites a different kind of engagement, one where the system is evaluated on its decisions, which is something operators already know how to do. I do not have controlled evidence for this, but it aligns with my experience deploying these systems: people trust what they can evaluate, and decisions are easier to evaluate than objective functions.
The optimisation is real, but it is not the point
I want to be precise here. I am not arguing that RL does not involve optimisation. It obviously does. Policies are updated to improve expected return. Value functions are fitted to minimise prediction error. The mathematical framework is optimisation through and through.
What I am arguing is that optimisation is the mechanism, not the purpose. The purpose is to produce an agent that makes good decisions. The optimisation is how you get there: the training process, the update rule, the convergence behaviour. But once the agent is trained, what you deploy is a learned policy that maps observations to actions. There is no objective function being evaluated at inference time.
This is true of any trained model. A classifier at inference time is also "just" computing a forward pass, not optimising anything. But the distinction carries more practical weight in RL because of what the outputs do. A supervised learning model makes predictions that describe the world. An RL agent makes decisions that change it. When your system's outputs are actions with real operational consequences, the gap between training-time optimisation and deployment-time decision-making is not just a technical detail. It shapes how you monitor, evaluate, and govern the system.
Closing thoughts
The most useful mental model I have found for RL is not "optimisation algorithm" but "framework for learning to decide." It changes which problems I reach for RL to solve, how I design reward signals, how I evaluate deployed systems, and critically how I talk about these systems with the people who operate alongside them.
Optimisation is a means, not an end. The end is good decisions, made consistently, in situations that are too dynamic, too uncertain, or too complex for static rules to handle well. RL gives you a principled way to learn those decisions from experience. That is its real contribution: not that it maximises a number, but that it learns to choose well.