Leaderboard

This page summarizes performance of OR baselines, LLM policies, and OR–LLM hybrids on InventoryBench.

The primary evaluation metric is cumulative reward over the test period. We also report normalized reward, defined as the ratio between actual reward and a perfect‑foresight upper bound.

Overall leaderboard

The table below ranks all evaluated combinations of LLM and decision method by average normalized reward across all 1,320 benchmark instances. Higher is better, and 1.0 corresponds to the perfect‑foresight upper bound. Click column headers to sort by each metric.

Rank	Agent	Overall Score ↕	Real LT=0 ↕	Real LT=4 ↕	Real Stoch ↕	Synth LT=0 ↕	Synth LT=4 ↕	Synth Stoch ↕	Details
Loading...

Metric definition

For each instance, cumulative reward is:

Reward = Σ_t (profit_t × units_sold_t − holding_cost_t × units_held_t)

where:

units_sold_t = min(demand_t, available_inventory_t)
units_held_t is the end‑of‑period on‑hand inventory.

The perfect score assumes perfect knowledge of future demand and unlimited supply:

PerfectScore = Σ_t (demand_t × profit_t)

We report normalized reward as:

NormalizedReward = max(0, Reward / PerfectScore)

This normalizes performance across instances with different demand scales and cost parameters.

Method-level breakdown (per LLM × method)

For each row in the leaderboard, the detail page shows:

The underlying model name (for example, google/gemini-3-flash-preview).
Average normalized reward by dataset family:
- synthetic_trajectory (720 instances)
- real_trajectory (600 instances)
Average normalized reward by lead‑time setting:
- lead_time_0, lead_time_4, lead_time_stochastic

In future iterations we can further extend these pages with per‑instance tables and links into raw benchmark_results.json files.

Submission Guidelines

We welcome submissions of new methods to the InventoryBench leaderboard! To submit your method:

Implement your method following the benchmark interface for the 1,320 test instances.
Generate results including cumulative rewards and decisions for all instances across all 6 trajectory/lead-time categories.
Calculate metrics including mean normalized reward and standard errors for each category.
Create a detailed description of your approach in a README.md file.
Submit your results to the InventoryBench repository via a pull request, including:
- results/{Method_Name}/benchmark_results.json
- results/{Method_Name}/scores.json
- results/{Method_Name}/README.md

Your method will be evaluated and added to the leaderboard upon acceptance. For more details, see the project documentation on GitHub.