Reinforcement Learning in Container Terminals: When It Works, When It Doesn't, and Why That Matters

Container terminals are among the most operationally complex environments in global logistics. Thousands of containers arrive daily on vessels, trucks, and barges. Each one needs to be unloaded, stored, retrieved, and loaded onto the next mode of transport, all while respecting vessel stability constraints, equipment availability, yard capacity, and delivery deadlines that can shift by the hour.

When I first started working on these problems, the appeal of reinforcement learning felt obvious. RL is an excellent tool for sequential decision-making under uncertainty. Container terminals are exactly that: a long chain of interdependent decisions made with incomplete information in a constantly changing environment. But after several years of building and deploying RL systems in real European container terminals, I have developed a more nuanced view. RL is genuinely powerful here. But it is also genuinely not always the right answer.

The problem landscape

To understand where RL fits, you first need to understand the operational problems a terminal faces. These are not one problem. They are many, and are tightly coupled:

Container stowage planning: deciding where each container goes on a vessel. A single poor placement can cascade into dozens of unnecessary moves later. The search space is combinatorial and grows factorially with vessel size.
Yard stacking: containers in the yard are stacked in blocks. Retrieving one from the bottom means moving everything above it. Good stacking decisions now prevent expensive reshuffles later.
Crane scheduling: quay cranes, yard cranes, and straddle carriers all need to be coordinated. Conflicts, deadlocks, and idle time are constant threats.
Equipment allocation: assigning the right number of vehicles, cranes, and personnel to each task, balancing throughput against operational cost.

What makes these problems hard is not just their individual complexity. It is how they interact. A stowage decision affects yard operations. Crane scheduling depends on stacking layouts. Equipment allocation shapes how fast any of it moves. Classical operations research approaches such as mixed-integer programming, constraint programming, and heuristic search have been the workhorses of terminal optimisation for decades. They work. But they often struggle with the dynamic, uncertain nature of real operations, where vessel arrivals shift, trucks show up late, and equipment breaks down.

Where RL shines

RL's core strength is learning policies that handle sequential decision-making in environments where the state changes between decisions. In a container terminal, this is the default operating condition, not the exception.

Take stowage planning. A traditional optimiser might solve for an optimal plan given the current manifest, only to have that plan partially invalidated when a vessel arrives at a changed berth, or when a set of containers marked for loading gets delayed at the gate. An RL agent trained across thousands of simulated scenarios can learn a policy that is robust to these disruptions, making good-enough decisions quickly rather than perfect decisions that assume a world that no longer exists.

In our tests, we have seen RL consistently outperform static heuristics in situations with high variance: vessels with unusual container mixes, peaks in throughput demand, or novel yard configurations that rule-based systems were never explicitly designed for. The agent generalises. It does not need a new rule for every new situation. It has learned the underlying structure of what makes a decision good.

RL also handles the temporal credit assignment problem naturally. Placing a container in a particular bay slot has consequences that only materialise moves or even hours later. RL agents can learn to anticipate these long-horizon effects in a way that greedy heuristics fundamentally cannot.

Where RL falls short

Here is where honesty matters. RL is not the best tool for every problem in a terminal, and pretending otherwise is a fast way to lose credibility with operators who have been running these facilities for decades.

Hard constraint satisfaction. Container terminals operate under strict physical and regulatory constraints. Vessel stability limits are non-negotiable. A vessel that exceeds its stress limits does not sail. Dangerous goods segregation rules are legally binding. RL agents, by nature, learn soft optimisation objectives. They can be penalised for constraint violations, but penalty shaping is fragile. A well-formulated mixed-integer program will guarantee feasibility. An RL policy cannot, at least not without additional safeguards.

Sample efficiency. Training RL agents for terminal operations requires a simulator. A good simulator. Building one that faithfully captures the physics of crane movements, container weights, vessel dynamics, and yard topology is a significant engineering effort. And even with a good simulator, RL training is sample-hungry. For problems where the action space is well-structured and the constraints are clear, an optimiser will find a good solution in seconds.

The sim-to-real gap. No simulator perfectly captures a real terminal. Equipment behaves differently under load. Human operators make decisions that no model predicts. Weather affects crane operations in ways that are hard to parameterise. Policies that perform well in simulation can degrade in production if the gap is not carefully managed through domain randomisation, conservative deployment strategies, and continuous monitoring.

Interpretability. Terminal planners need to understand why a system is making a recommendation. "The neural network said so" is not an acceptable answer when a planner is responsible for a vessel carrying millions of euros in cargo. RL policies are opaque by default, and making them interpretable requires deliberate effort: attention mechanisms, decision logging, and counterfactual explanations. This adds significant engineering complexity.

When the problem is actually static. Some terminal problems are more static than they appear. If a crane scheduling problem can be fully specified upfront and the environment does not change during execution, a well-formulated optimisation model will almost certainly outperform RL. Not every problem needs a learned policy. Sometimes a solved model is just better.

The practitioner's middle ground

The most effective systems I have built are not pure RL and not pure optimisation. They are hybrids.

In practice, this often looks like RL handling the high-level sequential decisions (which containers to prioritise, how to allocate resources across competing tasks, when to deviate from a plan) while constraint programming or mathematical optimisation handles the low-level feasibility checks. This gives you the adaptability of learned policies with the guarantees of formal methods.

Another pattern that works well: using RL to learn a value function that guides a search-based planner. Instead of the agent directly outputting actions, it learns to evaluate states. A tree search or beam search then uses those evaluations to plan ahead, respecting constraints at every step. This gets you the long-horizon reasoning of RL without the constraint-violation risks of direct policy deployment.

Domain expertise is the glue that holds these hybrid systems together. Knowing which sub-problems are dynamic enough to benefit from RL, which are static enough for direct optimisation, and where the interfaces between them should sit. That knowledge does not come from the ML literature. It comes from spending time in terminals, understanding the operational reality, and respecting the decades of domain knowledge that terminal operators bring to the table.

What I have learned

After deploying RL systems that have the ability to handle hundreds of thousands of real container decisions daily, a few lessons have become clear:

Start with the problem, not the method. The question is never "can we use RL here?" It is "what is the best approach for this specific operational problem?" Sometimes that is RL. Often it is not. Frequently it is a combination.
Earn trust incrementally. Operators will not hand over control of a vessel to an algorithm on day one. Start with decision support. Let the system prove itself on low-stakes decisions before moving to higher-stakes ones. Trust is earned through demonstrated reliability, not through impressive demos.
Invest in the simulator. The quality of your RL system is bounded by the quality of your simulation environment. A mediocre agent trained in a great simulator will outperform a sophisticated agent trained in a poor one.
Constraints are the problem, not mere obstacles. In academic RL, constraints are often treated as penalties to be tuned. In production, constraints are the non-negotiable reality that defines the problem. Build your system around them, not despite them.
Monitor relentlessly. A deployed RL system needs continuous monitoring. Distribution shifts happen. Terminal configurations change. New vessel types arrive. If you are not watching, your policy is silently degrading.

Closing thoughts

The goal of applying RL to container terminals was never to prove that RL works everywhere. It was to find the right tool for each problem and to build systems that actually improve operations: measurably, reliably, and in a way that terminal operators trust.

RL is a powerful tool in the toolkit. For dynamic, sequential, high-variance operational problems, it offers something that classical methods struggle to match. But it is one tool among many, and knowing when not to use it is just as important as knowing when to reach for it.

The best systems are honest about what each component does well and where it falls short. That honesty about methods, limitations, and what we actually know versus what we hope is what separates deployed systems from impressive papers.