- Ollama now supports multimodal models with its new engine, starting with vision multimodal models like Llama 4 Scout and Gemma 3.
- Llama 4 Scout is a 109 billion parameter model capable of answering location-based questions about video frames.
- Gemma 3 can analyze multiple images at once and identify common elements, such as animals appearing in all images.
- Qwen 2.5 VL is used for document scanning and character recognition, including translating Chinese spring couplets to English.
- Ollama's new engine improves reliability and accuracy for local inference, supporting future modalities like speech, image, and video generation.
- Model modularity ensures each model is self-contained, simplifying integration for creators and developers.
- Accuracy improvements include handling large images and ensuring correct positional information during processing.
- Memory management features include image caching and optimizations for efficient memory usage.
- Ollama collaborates with hardware manufacturers to optimize inference on various devices.
- Future goals include supporting longer context sizes, reasoning, tool calling with streaming responses, and enabling computer use.