Want to analyse images, read documents, or describe screenshots locally? Multimodal vision models in Ollama let you do all of this without sending images to the cloud. Here are the best Ollama vision models in 2026.
What Are Vision Models?
Vision models (also called multimodal or VLMs — vision language models) can process both text and images as input. You can send them a photo, screenshot, chart, or document and ask questions about it. All processing happens locally on your machine.
Top Ollama Vision Models
1. LLaVA 1.6 (34B) — Best Quality
LLaVA 1.6 in its 34B variant delivers the highest quality image understanding of any model available in Ollama. It accurately describes scenes, reads text in images, interprets charts, and answers detailed questions about visual content.
ollama run llava:34b
Best for: Detailed image analysis, document reading
RAM required: 24GB minimum
2. LLaVA 1.6 (7B) — Best Balance of Quality and Speed
The 7B variant of LLaVA 1.6 is the most popular vision model on Ollama for good reason. It handles most image tasks well and runs on consumer hardware. Ideal for general-purpose visual question answering.
ollama run llava:7b
Best for: General image tasks, everyday use
RAM required: 8GB minimum
3. Moondream — Best for Low-Resource Machines
Moondream is a tiny but capable vision model designed specifically for edge devices and machines with limited resources. It’s remarkably fast and handles basic image description and question answering well despite its small size.
ollama run moondream
Best for: Low-spec machines, simple image tasks
RAM required: 4GB minimum
4. LLaVA-Phi3 — Best for Speed
LLaVA-Phi3 combines Microsoft’s efficient Phi-3 architecture with LLaVA’s vision capabilities. The result is a fast, capable vision model that responds quickly while maintaining decent accuracy on most image tasks.
ollama run llava-phi3
Best for: Speed-sensitive applications
RAM required: 6GB minimum
5. BakLLaVA — Best for OCR Tasks
BakLLaVA is particularly strong at reading text within images. If your primary use case is extracting text from screenshots, photos of documents, or handwritten notes, BakLLaVA performs well above average.
ollama run bakllava
Best for: OCR, reading text in images
RAM required: 8GB minimum
Quick Comparison
| Model | Quality | Speed | RAM | Best Use |
|---|---|---|---|---|
| LLaVA 1.6 34B | Excellent | Slow | 24GB | Detailed analysis |
| LLaVA 1.6 7B | Very Good | Fast | 8GB | General use |
| Moondream | Good | Very Fast | 4GB | Low-spec machines |
| LLaVA-Phi3 | Good | Very Fast | 6GB | Speed priority |
| BakLLaVA | Good | Fast | 8GB | OCR/text reading |
How to Use Vision Models in Ollama
You can pass images to vision models directly from the command line:
ollama run llava "Describe this image" /path/to/image.jpg
Or via the API:
curl http://localhost:11434/api/generate -d '{
"model": "llava",
"prompt": "What is in this image?",
"images": ["<base64-encoded-image>"]
}'
Our Recommendation
LLaVA 1.6 7B is the best starting point for most users — it runs on a typical gaming PC or workstation and handles the majority of vision tasks well. If you’re on limited hardware, Moondream is your best option. For maximum quality, go with LLaVA 1.6 34B.
For more on running multimodal models, see our guide to using multimodal vision models with Ollama.


