Ollama makes it straightforward to run large language models locally, but its capabilities extend well beyond text. A growing number of models available through Ollama can accept image inputs alongside text prompts — a capability referred to as multimodal or vision AI. This guide covers what vision models are, which ones are available via Ollama, and how to use them from the command line, the API, and Python.
What Are Multimodal Vision Models?
A standard large language model works entirely with text: you send a text prompt, it returns a text response. A multimodal model can additionally process one or more images as part of the input. This opens up tasks that are impossible with text-only models:
- Describing the contents of a photo in natural language
- Answering specific questions about what appears in an image
- Extracting and transcribing text visible in a screenshot
- Analysing charts, graphs, and diagrams
- Identifying objects, people, or scenes
- Providing structured data extracted from unstructured image content
Under the hood, these models combine a vision encoder (which converts pixel data into a format the language model can understand) with the language model itself — reasoning across text and images in one pass.
Vision Models Available in Ollama
LLaVA (Large Language and Vision Assistant)
LLaVA is one of the original open-source vision-language models and remains widely used. Available in three sizes: llava:7b, llava:13b, and llava:34b. The 7B variant runs comfortably on most hardware with 8 GB of RAM. The 34B delivers considerably better quality but requires 24 GB or more of RAM.
LLaVA-Phi3
Available as llava-phi3, this model pairs the LLaVA vision component with Microsoft’s Phi-3 language model. Smaller and faster than standard LLaVA variants while maintaining good output quality for everyday vision tasks.
Moondream2
The moondream model is exceptionally compact at 1.8 billion parameters — designed specifically for speed and efficiency on constrained hardware. The best option for simple image description tasks on low-spec machines. For quick captioning or basic object identification it performs well above its weight.
Llama 3.2 Vision
Meta’s llama3.2-vision:11b is currently the strongest general-purpose vision model available in Ollama. It delivers noticeably better spatial reasoning, OCR accuracy, and chart analysis than the LLaVA family at a comparable size. The recommended starting point if you have 8 GB or more of RAM.
BakLLaVA
Available as bakllava, this model combines the LLaVA vision approach with a Mistral base model — an alternative flavour worth experimenting with if you are already using Mistral-based models in your workflow.
Pulling a Vision Model
ollama pull llava
ollama pull llava:13b
ollama pull llava-phi3
ollama pull moondream
ollama pull llama3.2-vision
Download sizes range from roughly 1.5 GB for Moondream to 8 GB or more for larger LLaVA variants. Once downloaded, the model is cached locally and available immediately.
Using Vision Models from the CLI
Start an interactive session:
ollama run llava
Once the model is loaded, attach an image by including its file path in your message:
What is in this image? /path/to/your/image.jpg
Note: images are not retained between turns. Each message is processed independently. If you want the model to reference the same image in a follow-up, attach it again.
Using Vision Models via the Ollama API
To send an image via the API, encode it as a Base64 string and include it in the images array of the request body.
Encoding an Image to Base64
base64 -i image.jpg
Full curl Example
curl http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llava",
"prompt": "Describe what you see in this image.",
"images": ["'$(base64 -i image.jpg)'"],
"stream": false
}'
Ollama handles image resizing internally. Supported formats: JPEG, PNG, and WebP.
Using Vision Models with Python
pip install ollama
Pass an Image by File Path
import ollama
response = ollama.chat(
model='llama3.2-vision',
messages=[
{
'role': 'user',
'content': 'What is shown in this image? Describe it in detail.',
'images': ['/path/to/your/image.jpg']
}
]
)
print(response['message']['content'])
Pass a Base64-Encoded Image
import ollama
import base64
with open('screenshot.png', 'rb') as f:
image_data = base64.b64encode(f.read()).decode('utf-8')
response = ollama.chat(
model='llava',
messages=[
{
'role': 'user',
'content': 'Extract all text visible in this screenshot.',
'images': [image_data]
}
]
)
print(response['message']['content'])
Practical Use Cases
- Screenshot OCR: Extract text from screenshots, scanned documents, or photos of printed text without sending data to an external service
- Chart and diagram analysis: Summarise trends in a chart or explain the structure of a diagram
- Product image descriptions: Generate alt text or product descriptions from photographs automatically
- Object identification: Ask what objects, animals, or landmarks appear in a photo
- Image-based Q&A: Feed an image of a whiteboard or handwritten notes and ask specific questions about the content
Using Vision Models in Open WebUI
If you are running Open WebUI alongside Ollama, vision model support is built in. Select a vision-capable model from the model dropdown, then drag and drop an image directly into the chat input area before sending your message. Open WebUI handles the encoding and API call automatically — a practical option for non-technical users who want local vision AI without writing any code.
Limitations and Honest Expectations
- Speed: Vision models are generally slower than text-only models of comparable size, because the image encoding step adds processing overhead
- Quality ceiling: Open-weight 7B–11B vision models do not match GPT-4V or Claude 3 on complex reasoning tasks — capable for many practical tasks but will occasionally miss fine-grained details
- Stateless images: Images are not retained between messages — re-attach if you need continuity across turns
- Text-dense images: OCR accuracy on low-resolution or stylised text can be inconsistent; results improve with higher-resolution input
Choosing the Right Model for Your Hardware
- 8 GB RAM or more: Use
llama3.2-vision:11bfor the best general-purpose results - 8 GB RAM, want faster responses:
llava:7borllava-phi3— a balance between quality and speed - 4 GB RAM or constrained hardware: Use
moondream— runs quickly even on machines that struggle with larger models


