mike
· 2w
Vending-Bench: Testing long-term coherence in agents
https://andonlabs.com/evals/vending-bench
Interesting. From the abstract of the linked paper:
> Claude 3.5 Sonnet and o3-mini manage the machine well in most runs and turn a profit, but all models have runs that derail, either through misinterpreting delivery schedules, forgetting orders, or descending into tangential "meltdown" loops from which they rarely recover.