Two academic benchmarks reveal GPT-5.5’s contrasting performance: strong in isolated command-line operations but weaker in extended, multi-step software engineering. Terminal-Bench 2.0 shows the model ...