Fine-tuning the model is crucial for successful cross-modal transfer with ORCA, while excessive embedder training may not always improve performance.