A unified framework that effectively leverages both visual and verbal references to improve target perception and discrimination for natural language tracking.