A small benchmark of short natural-language prompts that look easy but expose weak reasoning — goal grounding, world-state tracking, social pragmatics, modified-riddle templates, literal precision ...