Large language models struggle to reliably detect self-contradictions in long documents, with GPT4 performing the best among the evaluated models.