The author successfully implements multi-modal auto-regressive modeling with visual words, providing supervision information for visual modeling and enhancing vision-language comprehension.