InventoryBench

A 1,320‑instance benchmark for stress‑testing OR and LLM‑based agents under non‑stationary demand and uncertain lead times.

Dataset Code & Data Leaderboard Paper (coming soon)

Benchmark for OR‑ and LLM‑based inventory agents on synthetic and real trajectories.

Why this benchmark?

Classical inventory algorithms from operations research perform well when demand and lead times are stable and correctly modeled. In many real systems, however, demand shifts over time, seasonality appears, and supply is unreliable, so standard predict‑then‑optimize pipelines can break.

Recent large language models can reason with rich contextual descriptions and adapt to regime changes, but they are not a drop‑in replacement for well‑understood OR heuristics. Practitioners need a controlled way to study when LLM agents help, when they hurt, and how they should be combined with traditional tools.

This benchmark provides a common testbed for multi‑period inventory control with challenging demand and lead‑time patterns, enabling apples‑to‑apples comparisons between OR baselines, LLM agents, and hybrid methods.

Benchmark design

Each benchmark instance is a finite‑horizon inventory game defined by a fixed demand and arrival trajectory, with profit from sales and holding costs on leftover stock.

Synthetic trajectories (720 instances): 10 demand patterns × 4 variants, with shocks, trends, variance changes, seasonality, temporary spikes/dips, and autocorrelated AR(1) behavior.
Real trajectories (600 instances): weekly sales for 200 retail products, each combined with three lead‑time settings and cost structures.
Lead times and lost orders: immediate delivery, fixed‑delay delivery, and stochastic lead times where orders may be delayed or never arrive.

Leaderboard (preview)

We evaluate OR baselines, LLM policies, and OR–LLM hybrids on the full set of 1,320 instances. Each agent is a distinct method; a full leaderboard is available on the dedicated page.

Rank	Agent	Avg Normalized Reward	Details
Loading...

Benchmark at a glance

1,320 instances

Synthetic and real inventory trajectories for controlled experiments.

2 benchmark families

Synthetic stress tests and real‑data demand from retail products.

Lead‑time regimes

Immediate, fixed, and stochastic lead times, including lost orders.

Methods and agents

OR baselines, LLM agents, OR–LLM hybrids, and beyond.

Submit Your Policy

Have your own inventory control method? Evaluate it on InventoryBench and submit to the leaderboard.

Implement your policy following the InventoryPolicy interface
Run your policy on all 1,320 instances using the provided evaluation scripts
Evaluate your results to compute normalized scores and generate scores.json
Package your submission with results, scores, and method description
Submit via GitHub PR or email

For complete instructions: See the Evaluation Guide in the repository for policy interface, evaluation framework, file formats, and detailed submission guidelines.

Dataset overview

The benchmark is organized as a collection of instance directories. Each directory contains two CSV files:

train.csv: five historical demand observations per instance.
test.csv: the demand trajectory and associated lead times and cost parameters used for evaluation.

See the dataset page for full details on directory layout and file formats.

Explore the dataset