The author evaluates the competency of large vision-language models in specialized and general tasks, highlighting their limitations and potential for improvement.