The Bridge Between Visual Perception and Natural Language
A Vision-Language Model (VLM) is a type of AI designed to process and understand both images (or video) and text simultaneously. Unlike traditional computer vision models that only classify objects, VLMs can "read" an image and discuss it in natural language.
They work by mapping visual features into the same mathematical space (embeddings) as words, allowing the LLM to "see" the visual data as tokens.
Before VLMs, computer vision was task-specific (e.g., one model for faces, one for cars). VLMs are general-purpose. You can give a VLM a photo of a refrigerator and ask, "What can I cook with these ingredients?"—a task that requires understanding both visual identification and culinary logic.