Deeper transformer language models exhibit better compositional generalization than shallower models, even when controlling for total parameter count.