Large language models struggle with multistep reasoning, prompting the need for a new evaluation dataset like MuSR.