Benchmarking Single Agent Performance
We explore how increasing the number of instructions and tools available to a single ReAct agent affects its performance, benchmarking models like claude-3.5-sonnet, gpt-4o, o1, and o3-mini across two domains of tasks.