InventoryBench
A 1,320‑instance benchmark for stress‑testing OR and LLM‑based agents under non‑stationary demand and uncertain lead times.
Benchmark for OR‑ and LLM‑based inventory agents on synthetic and real trajectories.
Why this benchmark?
Classical inventory algorithms from operations research perform well when demand and lead times are stable and correctly modeled. In many real systems, however, demand shifts over time, seasonality appears, and supply is unreliable, so standard predict‑then‑optimize pipelines can break.
Recent large language models can reason with rich contextual descriptions and adapt to regime changes, but they are not a drop‑in replacement for well‑understood OR heuristics. Practitioners need a controlled way to study when LLM agents help, when they hurt, and how they should be combined with traditional tools.
This benchmark provides a common testbed for multi‑period inventory control with challenging demand and lead‑time patterns, enabling apples‑to‑apples comparisons between OR baselines, LLM agents, and hybrid methods.
Benchmark design
Each benchmark instance is a finite‑horizon inventory game defined by a fixed demand and arrival trajectory, with profit from sales and holding costs on leftover stock.
- Synthetic trajectories (720 instances): 10 demand patterns × 4 variants, with shocks, trends, variance changes, seasonality, temporary spikes/dips, and autocorrelated AR(1) behavior.
- Real trajectories (600 instances): weekly sales for 200 retail products, each combined with three lead‑time settings and cost structures.
- Lead times and lost orders: immediate delivery, fixed‑delay delivery, and stochastic lead times where orders may be delayed or never arrive.
Benchmark at a glance
Synthetic and real inventory trajectories for controlled experiments.
Synthetic stress tests and real‑data demand from retail products.
Immediate, fixed, and stochastic lead times, including lost orders.
OR baselines, LLM agents, OR–LLM hybrids, and beyond.
Submit Your Policy
Have your own inventory control method? Evaluate it on InventoryBench and submit to the leaderboard.
- Implement your policy following the
InventoryPolicyinterface - Run your policy on all 1,320 instances using the provided evaluation scripts
- Evaluate your results to compute normalized scores and generate
scores.json - Package your submission with results, scores, and method description
- Submit via GitHub PR or email
For complete instructions: See the Evaluation Guide in the repository for policy interface, evaluation framework, file formats, and detailed submission guidelines.
Dataset overview
The benchmark is organized as a collection of instance directories. Each directory contains two CSV files:
train.csv: five historical demand observations per instance.test.csv: the demand trajectory and associated lead times and cost parameters used for evaluation.
See the dataset page for full details on directory layout and file formats.